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Abstract 



Statistical methods for automatically extracting information about associations between words or 
documents from large collections of text have the potential to have considerable impact in a number of 
areas, such as information retrieval and natural-language-based user interfaces. However, even huge 
bodies of text yield highly unreliable estimates of the probability of relatively common events, and, 
in fact, perfectly reasonable events may not occur in the training data at all. This is known as the 
sparse data problem. Traditional approaches to the sparse data problem use crude approximations. 
We propose a different solution: if we are able to organize the data into classes of similar events, 
then, if information about an event is lacking, we can estimate its behavior from information about 
similar events. This thesis presents two such similarity-based approaches, where, in general, we 
measure similarity by the Kullback-Leibler divergence, an information-theoretic quantity. 

Our first approach is to build soft, hierarchical clusters: soft, because each event belongs to each 
cluster with some probability; hierarchical, because cluster centroids are iteratively split to model 
finer distinctions. Our clustering method, which uses the technique of deterministic annealing, 
represents (to our knowledge) the first application of soft clustering to problems in natural language 
processing. We use this method to cluster words drawn from 44 million words of Associated Press 
Newswire and 10 million words from Grolier's encyclopedia, and find that language models built 
from the clusters have substantial predictive power. Our algorithm also extends with no modification 
to other domains, such as document clustering. 

Our second approach is a nearest-neighbor approach: instead of calculating a centroid for each 
class, we in essence build a cluster around each word. We compare several such nearest-neighbor 
approaches on a word sense disambiguation task and find that as a whole, their performance is far 
superior to that of standard methods. In another set of experiments, we show that using estimation 
techniques based on the nearest-neighbor model enables us to achieve pc;rplexity reductions of more 
than 20 percent over standard techniques in the prediction of low-frequency events, and statistically 
significant speech recognition error-rate reduction. 
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Chapter 1 

Introduction 



"You shall know a word by the company it keeps!" (Firth, 1957, pg. 11) 



We begin by considering the problem of predicting string probabilities. Suppose we are presented 
with two strings, 

1. "Grill doctoral candidates", and 

2. "Grill doctoral updates" , 

and are asked to determine which string is more likely. Notice that this is not the same question as 
asking which of these strings is grammatical. In fact, both constitute legitimate English sentences. 
The first sentence is a command to ask a graduating Ph.D. student many difhcult questions; the 
second might be an order to take lists of people who have just received doctorates and throw them|^ 
on a Hibachi. 

Methods for assigning probabilities to strings are called language models. In this thesis, we will 
abuse the term somewhat and refer to methods that assign probabilities to word associations as 
language models, too. That is, we will consider methods which estimate the probability of word 
cooccurrence relations; these methods need not be defined on sentences. For example, in chapters 
^ and ^ we will be concerned with the problem of estimating the probability that a noun x and a 
transitive verb y appear in a sentence with x being the head noun of the direct object of y. 

One important application of language modeling is error correction. Current speech recognizers 
do not achieve perfect recognition rates, and it is easy to imagine a situation in which a speech 
recognizer cannot decide whether a speaker said "Grill doctoral candidates" or "Grill doctoral up- 
dates" . A language model can provide a speech recognizer with the information that the former 
sentence is more likely than the latter; this information would help the recognizer make the right 
choice. Similar situations arise in handwriting recognition, spelling correction, optical character 
recognition, and so on — whenever the physical evidence itself may not be enough to determine the 
corresponding string. 

More formally, let E be some physical evidence, and suppose we wish to know whether the string 
W is the message conveyed or encoded by E. Using Bayes' rule, we can combine the estimate 
P{E\W) given by an acoustic model with the probability Plm{W) assigned by a language model to 
find the posterior probability that W is the true string given the evidence at hand: 

PiW\E) ^ (1.1) 

(since the evidence E is fixed, it is the same for every hypothesized string W, so the P{E) term is 
generally ignored in practice). Thus, in a situation where two hypothesized strings cannot be distin- 

i(the lists) 
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guished on the basis of the physical evidence alone, a language model can provide the information 
necessary for disambiguation. 

Another application of language modeling is machine translation. Suppose one needs to translate 
the phrase "Grill doctoral candidates" to another language. Two possible target sentences are "Ask 
applicants many questions" and "Roast applicants on a spit" . If we have a language model that 
furnishes us the information that the first sentence is more likely than the second, then (in the 
absence of context providing evidence to the contrary) we would pick the first sentence as the 
correct translation. 

This thesis is concerned with statistical approaches to problems in natural language processing. 
Typically, statistical approaches take as input some large sample of text, which may or may not be 
annotated in some fashion, and attempt to learn characteristics of the language from the statistics 
in the sample. They may also make use of auxiliary information gained from such sources as on- 



line dictionaries or WordNet ([Miller, 1995[ ) . An important advantage of statistical approaches over 



traditional linguistic models is that statistical methods yield probabilities. These probabilities can 



easily be combined with estimates from other components, as in equation (1_^) above. Traditional 
linguistic models, on the other hand, only describe whether or not a string is grammatical. This 
information is too coarse-grained for use in practical tasks; for instance, both "Grill doctoral candi- 
dates" and "Grill doctoral updates" are valid sentences, and yet we know that the first string is far 
more likely than the second. 

Perhaps the simplest statistical approach to language modeling is the maximum likelihood es- 
timate (MLE), which simply counts the number of times that the string of interest occurs in the 
training sample S and normalizes by the sample size. For "Grill doctoral candidates" , this estimate 
takes the form 

„ .„ , , , „, C( "Grill doctoral candidates" ) „, 

-PAfLi;( GriU doctoral candidates") = — ^ — ^, (1.2) 

\S\ 

where C( "Grill doctoral candidates") is the number of times the phrase occurred in S. The quantity 
151 might be the number of word triples in S, or the number of sentences in S, or some other relevant 
measure. 

Notice that if the event of interest is unseen, that is, docs not occur in S, then the maximum 
likelihood estimate assigns it a probability of zero. In terms of practicality, this turns out to be 
a fatal flaw because of the sparse data problem: even if S is quite big, a large number of possible 
events will not appear in S. Assigning all unseen events a probability of zero, as the MLE does, 
amounts to declaring many perfectly reasonable strings to have zero probability of occurring, which 
is clearly unsatisfactory. 

To illustrate the pernicious nature of the sparse data problem, we present the following example. 
Consider the s et S to be the text contained in all t he pages indexed by AltaVista, Digital's web 



search engine ( Digital Equipment Corporation, 1997 ). Currently, this set consists of 31 million web 



pages, which, at an extremely conservative estimate, means that S contains at least a billion words. 
Yet at the time of this writing, the phrase "Grill doctoral candidates" does not occur at all among 
those billion words, so the MLE would rule out this sentence as absolutely impossible. 

Although the sparse data problem affects low-frequency events, it is incorrect to infer that it 
therefore is not important. One might attempt to claim that if an event has such a low probability 
that it does not occur in a very large sample, then actually estimating its probability to be zero will 
not be a major error. However, the aggregate probability of unseen events can be a big percentage 
of the test data, which means that it is quite important to treat unseen events carefully. Brown et 
al. ( 1992| ), for instance, studied a 350 million word sample of English text, and estimated that in 



any new sample drawn from the same source distribution, 14% of the trigrams (sequences of three 
consecutive words) would not have occurred in the large text. A speech recognizer that refused to 
accept 1 out of every 7 sentences would be completely unusable. 

As a historical aside, we observe that Noam Chomsky famously declared that sparse data prob- 
lems are insurmountable. 

It is fair to assume that neither sentence (1) [Colorless green ideas sleep furiously] nor (2) 
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[Furiously sleep ideas green colorless] ... has ever occurred .... Hence, in any statistical 
model ... these sentences will be ruled out on identical grounds as equally 'remote' from 
English. Yet (1), though nonsensical, is grammatical, while (2) is not."^ ( phomsky ^ 



1964, pg. 16) 



This thought experiment helped "[disabuse] the field once and for all of the notion that there was 



anything of interest to statistical models of language" (Abney, 1996). 

However, in the years since Chomsky wrote this remark, some progress on ameliorating the sparse 
data problem has been made. Indeed, Chomsky's statement is based on the false assumption that 
"any statistical model" must be based on the maximum likelihood estimate. This is certainly not 
the case. Two standard language modeling techniques used in speech recognition, Jelinek-Mercer 
smoothing and Katz back-off smoothing, make use of an estimator guaranteed to be non-zero. In 
the case where the probability of an unseen word pair {wi,W2) is bein g es timated, these methods 



incorporate the probability of word W2 (details can be found in section 2_^). But this is not always 
adequate: for example, the word "updates" appears on more web pages indexed by AltaVista than 
"candidates" does. 

The key idea in this thesis is that we can use similarity information to make more sophisticated 
probability estimates when sparse data problems occur. This idea is intuitively appealing, for if we 
know that the word "candidates" is somehow similar to the word "nominees" , then the occurrence 
of the sentence "Grill doctoral nominees" would lead us to believe that "Grill doctoral candidates" 
is also likely. 

The notion of similarity we explore is that of distributional similarity, since we will represent 
words as distributions over the contexts in which they occur (as implied by the quotation that opens 
this chapter) . We will thus be concerned with measures of the "distance" between probability mass 
functions. We discuss several such measures in chapter ^ but our main focus will be on using the 
KuUback-Leibler divergence, an information-theoretic quantity. 

The work presented in this thesis can be divided into two parts. The first is the development 
of a distributional clustering method for grouping similar words. This method builds probabilistic, 
hierarchical clusters: objects belong to each cluster with some probability, and clusters are broken 
up into subclusters so that a hierarchy results. We derive the method, exhibit clusters found by our 
method in order to provide a qualitative sense of how the method performs, and show that effective 
language models can be constructed from the clusters produced by our method. To our knowledge, 
this is the first probabilistic clustering method to be applied to natural language processing. 

The second part is the development of a more computationally efficient way to take incorpo- 
rate similarity information: a nearest-neighbor (or "most similar neighbor") language model that 
combines estimates from specific objects rather than from classes. We compare several different 
implementations of this type of model against standard smoothing methods and find that using 
similarity information leads to far better estimates. 

This thesis is organized as follows. Chapter ^ describes many of the theoretical results employed in 
later chapters. We discuss standard language modeling techniques and study the properties of several 
distributional similarity functions. Chapter ^ presents our distributional clustering method. Chapter 
^ develops the nearest-neighbor approach and compares the performance of several implementations 
on a pseudo-word-disambiguation task. Chapter |^ considers an extension of our nearest-neighbor 
approach and studies its performance on more realistic tasks. We conclude with a brief summary of 
the thesis and indicate directions for further work in chapter O. 



^Ironically, this remark is now so well known that it has become false: use of AltaVista reveals that at the time of 
this writing, 40 web pages contain the first sentence, whereas only three contain the second. 
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Chapter 2 

Distributional Similarity 



This chapter presents background material underlying the work in this thes is. In section 2.1 we argue 
that representing objects as distributions is natural and useful. Section 2.2 reviews common methods 
for estimating distributions from a sample. We use these methods to provide initial distributions to 
our algorithms and also as standards against which to compare the performance of our similarity- 
based estimates. Section 2.3 studies various functions measuring similarity between distributions. 
We pay particular attention to the KuUback-Leibler divergence ( Cover and Thomas, 1991 ), which 
plays a central role in our work. 



2.1 Objects as Distributions 

The first issue we must address is what representation to use for the objects we wish to cluster and 
compare. For the moment, we will be vague about what sorts of objects we will be considering; 



researchers have clustered everything from documents (Salton, 1968; Cutting et al., 1992) to irises 
( [Fisher, 193'^ ; [Cheeseman et al, 1988| ). 



We want the representation we choose to satisfy two requirements. First, the representation 
should be general enough to apply to many different types of objects. Second, any particular 
object's representation should be easy to calculate from samples alone; we do not want to use 
outside sources of information such as on-line dictionaries. This second condition expresses our 
preference for algorithms that are adaptable; if we rely on knowledge that is hard for computers to 
derive from training data, then we cannot use our algorithms on new domains without expending 
considerable effort on re-acquiring the requisite knowledge. Furthermore, large samples that have 
few or no annotation^ are far more common and readily obtainable than large highly-annotated 
samples, and thus working with representations adhering to the second condition tends to be much 
more convenient. 



Many clustering scheme s represent objects in terms of a set {^i, A2, . . . , An} of attributes (Kauf 



man and Rousseeuw, 199C). Each object is associated with an attribute vector (ai, 02, . . . , cat) of 
values for the attributes. Some attributes can take on an infinite number of values; for example, 
the mean of a normal distribution can be any real number. Other attributes, such as the sex of 
a patient, range only over a finite set. Usually, no assumptions are made about the relationship 
between different attributes. 

In this thesis, we use a restricted version of the attribute representation. Objects are equated 
with probability mass functions, or distributions: each attribute must have a nonnegative real 
value, and we require that all object attribute vectors (01,02, . . . ,ajv) satisfy the constraint that 
ctj — 1- We can think of a.; as the probability that the object assigns to Ai. This distributional 
representation for objects is particularly appropriate for situations arising in unsupervised learning, 
where a learning algorithm must infer properties of events from a sample of unannotated data. In 



^In the following chapters, we use either unannotated data or data that has been tagged with parts of speech. 
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such situations, we can define the attributes to be the contexts in which events can occur; the value 
a.i for a particular event is then the proportion of the time the event occurred in context i. 

For example, suppose we wish to learn about word usage from the following (small) sample of 
English text: "A rose is a rose is not a nose". Our events are therefore words. If we define the 
context of a word to be the following word, then the possible contexts are "a", "is", "nose", "not", 
and "rose". The attribute vector for the word "a" is (0, 0, 1/3, 0, 2/3), since "a" occurs before "nose" 
(the third attribute) one out of three times, and before "rose" (the last attribute) twice. 

In accordance with our first requirement for representations, the distributional representation is 
fairly general. For instance, we have demonstrated that words can be represented as distributions 
over subsequent words, and we can just as easily represent documents as distributions over the words 
that occur in them, or customers as distributions over the products they buy, and so on. Indeed, 
it is a reasonable representation whenever the data consists of a set of events (e.g., words occurring 
together) rather than measurements or properties (e.g., a list of each word's part of speech). Also, 
in compliance with our second requirement for distributions, the distributional representation for 
any object is trivial to calculate as long as the contexts are easily recognizable. Furthermore, we 
wish to apply our techniques to language modeling, a task for which probability distributions must 
be produced. Finally, the constraint that the components of attribute vectors sum to unity is of use 
to us in our calculations, as will be seen in chapter 0. 



2.2 Initial Estimates for Distributions 

The remainder of this thesis will be concerned with object distributions that have been estimated 
from object-context pairs. More formally, let X be the set of objects under consideration and y be the 
set of possible contexts, y = {yi,y2, ■ ■ ■ , Vn}- Assume that the data consists of pairs {x,y) G X xy 
along with counts C{x,y) of how many times {x,y) occurred in some training sample. Counts for 
individual objects and contexts are readily attained from counts for the pairs: C{x) — '^yC{x,y) 
and C{y) = ^'(x, ?/); without loss of generality, assume that every object and every context 
occurs at least once. We wish to represent object x by the conditional distribution P{y\x) all j/ G y. 
This distribution must be estimated from the data pairs. Of course, the goal of this thesis is to 
develop good estimates for P{y\x), but we need some initial distributions to start with. 

A particularly simple estimation method is the maximum likelihood estimate (MLE) PMLE{y\x)'. 

PMLE{y\x) = ^^. (2.1) 

Notice that if the joint event {x, y) never occurs, then PMLE{y\x) — 0, which is equivalent to saying 
that any event that does not occur in the training sample is impossible. As noted in chapter |], using 
the maximum likelihood estimate tends to grossly underestimate the probability of low-frequency 
events. 



Many alternatives to the MLE (Good, 1953; Jclinck and Mercer, 1980; Katz, 1987; Church 



and Gale, 1991) take the MLE as an initial estimate and adjust it so that the total estimated 
probability of pairs occurring in the sample is less than one, leaving some probability mass for 
unseen pairs. These techniques are known as smoothing methods, since they "smooth over" zeroes 
in distributions. Typically, the adjustment involves either interpolation, in which the new estimator 
is a weighted combination of the MLE and an estimator guaranteed to be nonzero for unseen pairs, 
or discounting, in which the MLE is decreased to create some leftover probability mass for unseen 
pairs. 



The work of [Jclinck and Mercer (1980| ) is the classic interpolation method. They produce an 
estimate by linearly interpolating the MLE for the conditional probability of an object-context 
pair {x,y) with the maximum likelihood estimate PMLE{y) = C!{y) / ^yC{y) for the probability of 
context y: 

PjM{y\x) = \{x)PMLE{y\x) + (1 - \{x))PMLE{y). (2.2) 
The function X{x) ranges between and 1, and reflects our confidence in the available data regarding 
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X. If X occurs relatively frequently, then we have reason to believe that the MLE for the pair (x, y) is 
reliable. We then give X{x) a high value, so that Pjm depends mostly on Pmle{v\x)- On the other 
hand, if x is relatively rare, then PMLE{y\x) is unlikely to be very accurate. In this case, we decide 
to rely more on PMLE{y), since counts for the single event y are higher than counts for the joint 
event {x,y). We therefore set X{x) to a relatively low value. A method for training A is described 



by Bahl, Jelinek, and Mercer (1983) 



A popular alternative in the speech recognition literature is the back-off discounting method of 



Katz (1987). It provides a clear separation between frequent events, for which observed frequencies 
are reliable probability estimators, and low-frequency events, whose prediction must involve addi- 
tional information sources. Furthermore, the back-off model does not require complex estimation 
calculations for interpolation parameters such as A(x), as is the case for Jelinek and Mercer's method. 
Katz first uses the Good- Turing formula ( Good, 1953| ) to replace the actual frequency C{x,y) 



of an object-context pair with a discounted frequency C*{x, y). Let denote the number of pairs 
that occurred m times in the sample. The Good- Turing estimate then defines C'*{x,y) as 

C*{x,y)^{C{x,y) + l)'^^^^^^^. 

nC{x,y) 

This discounted frequency is used in the same way the true frequency is used in the MLE (equation 



(yD): 

C*{x,y) 

As a consequence, the estimated conditional probability of an unseen pair (x', y') is 

(ni/no) 



Pdiy'W) 



C{x') 



Thus, the probability mass assigned to unseen pairs involving object x' is distributed uniformly. The 
total mass assigned to unseen pairs involving x' is simply the complement of the mass assigned to 
seen pairs involving x' : 

Hx) = 1 - 

y:C{x',y)>0 



For more details, see Nadas (1985), who presents three different derivations (two empirical-Bayesian 
and one empirical) of the Good- Turing estimate. 

Katz alters the Good- Turing treatment by not using for unseen pairs. Rather, he bases his 
estimate of the conditional probability of an unseen pair {x',y') on an estimate of the probability 
of y' . This amounts to assuming that the behavior of {x',y') is independent of the behavior of x'; 



Jelinek and Mercer (1980| ) make a similar assumption when they set A to a low value in equation 



(2.2) 



More formally, we write the estimate for an arbitrary pair (x, y) in the following form, which is not 
Katz's original presentation but will be convenient for us in chapters |4| and |^ (note the asymmetrical 
treatment of seen and unseen pairs) : 

P(y\x) = /^'^(2/N) if C(x, y) > ^2 3) 

|_Q!(a;)Pr(2/|a:) otherwise ((x,?;) is unseen) 

Pr is the model for probability redistribution among unseen pairs. Katz (implicitly) defines Pr as 
the probability of the context: 

Pr{y\x) = P{y), 



so that 



PBo{y\x)^{^'^\t] -."K^^"'^^''^ ■ (2-4) 
"^^^^ > la(a;)P(y) otherwise ^ ' 



In later chapters, we will take advantage of the placeholder P,. to insert our own similarity-based 
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probability redistribution models. The quantity a(x) is a normalization factor required to ensure 
that T,yPBo{y\x) = 1: 



a{x) 



m 

^~Ey:Cix.,v)>QPr{y\x)' 

The second formulation of the normalization is computationally preferable because it is generally 
the case that the total number of possible pairs far exceeds the number of observed pairs. 

Should we use Jelinek-Mercer smoothing, Katz back-off smoothing, or perhaps some other tech- 



nique? A thorough study by Chen and Goodman (1996 ) showed that back-off and Jelinek-Mercer 



smoothing perform consistently well, with back-off generally yielding better results for modeling 
pairs. Since the back-off formulation also contains a placeholder for us to apply similarity-based 
estimates, we will use Katz's estimation method whenever smoothed distributions are required. 



2.3 Measures of Distributional Similarity 

In this section, we consider theoretical and computational properties of several functions measuring 
the "similarity" between distributions. We refer to these functions as distance functions, rather 
than similarity functions, since most of them achieve their minimum when the two distributions 
being compared are maximally similar (i.e., identical). The work described in chapters ^ and ^ uses 
negative exponentials of distance functions when true similarity functions (that is, functions that 

increase as similarity increases) are required. 

W e certainly do not intend to give an exhaustive listing of all distance functions. (See Andcrbcr, 



(1973 ) for an extensive survey.) Our purpose is simply to examine important properties of functions 
that we use or that are commonly employed by other researchers in natural language processing and 
machine learning. 



We discuss the KL divergence in section 2.3.1 in detail, as it forms the basis for most of the work 



in this thesis. We also describe several other distance functions, including the total divergence to the 



mean (section 2.3.2), various geometric norms (section 2.3.S), and some similarity statistics (section 
2.3.4 ) . We will pay particular attention to the computational requirements of these functions. In view 
of the fact that we wish to use very large data sets, we will require that the time needed to calculate 
the distance between any two distributions be linear or near- linear in the number of attributes. This 
demand is not strictly necessary for the work described in this thesis ~ the clustering work of chapter 
^ depends on the use of the KL divergence, and the similarity computations of chapters ^ and ^ are 
done in a preprocessing phrase. However, one of our future goals is to find adaptive versions of our 
algorithms, in which case we must use functions that can be computed efhciently. 



We defer discussion of the confusion probability, defined by Essen and Steinbiss (1992), until 
chapter ^. This function is of great importance to us because Essen and Steinbiss's co-occurrence 
smoothing method is quite similar to our own work on language modeling. The reason we do not 
include the confusion probability in this chapter is that it is not a function of two distributions : 
each object x is described both by the conditional probability P(y\x) and the marginal probability 
Pix), so that comparing two objects involves four distributions. 

For the remainder of this section, let xi, X2, and x^ be three objects with associated distributions 
P{-\xi), P{-\x2), and P(-|a;3), respectively. It doesn't matter how these distributions were estimated. 
For notational convenience, we will call these distributions q, r, and s. We will occasionally refer to 
a distribution p by its corresponding attribute vector {p{yi),p{y2), ■ ■ ■ jP(2/a'))- 

2.3.1 KL Divergence 

We define the hmction Z3((7||r) as 
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D(g||r) = ^g(y)log 



<i{y) 
r{y) 



(2.5) 



(we will not specify the base of the logarithm). Limiting arguments lead us to set Olog ^ = 0, even 
if r = 0, and q log ^ = oo when q is not zero. 

Fu nction (p.5|) goe s by many names in the literature, including information gain (Renyi, 1970), 
error (Kerridge, 1961), relative entropy, cross entropy, and KuUback Lcibler distance ( pover and 



Thomas, 1991). KuUback himself refers to the function as information for discrimination, reserving 
the term "divergence" for the symmetric function D{q\\r) + D{r\\q) (KuUback, 1959). We will use 
the name KuUback- Leibler (KL) divergence throughout this thesis. 

The KL divergence is a standard information-theoretic "measure" of the dissimilarity between 
two probability mass functions, and has been applied to natural language processing (as described 
in this thesis), machine learning, and statistical physics. It is not a metric in the technical sense, 
for it is not symme tric and does not obey the triangle inequality (see, e.g., theorem 12.6.1 of Cover 
and Thomas (1991 )). However, it is non-negative, as shown in the following theorem. 



Theorem 2.1 (Information inequality) D{q\\r) > 0, with equality holding if and only if q{y) = 
r{y) for all y ey. 

Proof. Most authors prove this theorem using Jensen's inequality, which deals with expectations 
of convex functions (notice that Z3(g||r) is the expected value with respect to q o f the quantit y 
log{q/r)). However, we present here a short proof attributed to Elizabeth Thompson ( Green, 1996 ) 



Let In denote the natural logarithm, and let 6 > be the base of the logarithm in (2^). First 
observe that for any z > 0, ln(z) < z — 1, with equality holding if and only if z = 1. Then, we can 



write 



-D{q\\r) = 



In(fe) 



yey 



< 



ln(6) 
1 

In(fe) 
1 

ln(6) 



yey 



rjy) 
liy) 

r{y) 



yyey yey 
(1-1) = 0, 



with equality holding if and only if = 1 for all y & y. I 

Since the KL divergence is when the two distributions are exactly the same and greater than 
otherwise, it is really a measure of dissimilarity, as mentioned above, rather than similarity. This 
yields an intuitive explanation of why we should not expect the KL divergence to obey the triangle 
inequality: as Hatzivassiloglou and McKeown (1993| ) observe, dissimilarity is not transitive. 

What motivates the use of the KL divergence, if it is not a true distance metric? We appeal to 
statistics, information theory, and the maximum entropy principle. 

The statistician KuUback (1959) derives the KL divergence from a Bayesian perspective. Let Y 
be a random variable taking values in y. Suppose we are considering exactly two hypotheses about 
Y: Hq is the hypothesis that Y is distributed according to g, and Hr is the hypothesis that Y is 
distributed according to r. Using Bayes' rule, we can write the posterior probabilities of the two 
hypotheses as 

P{Hq)q{y) 



P{H,\y) 



P{Hq)q{y) + P{Hr)r{y) ' 
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and 



Taking logs of both equations and subtracting, we obtain 

logiM=iogfi«-log^. 

We can therefore consider log (g(y)/r(?/)) to be the information y supplies for choosing Hq over 
Hr'- it is the difference between the logarithms of the posterior odds ratio and the prior odds ratio. 
D{q\\r) is then the average information for choosing Hq over Hr- Thus, the KL divergence does 
indeed measure the dissimilarity between two distributions, since the greater their divergence is, the 
easier it is, on average, to distinguish between them. 



Another statistical rationale for using the KL divergence is given by Cover and Thomas (1991 ). 
Let the empirical frequency distribution of a sample y of length n be the probability mass function 
Py, where Py{y) is simply the number of times y showed up in the sample divided by n. 

Theorem 2.2 Let r be a hypothesized source distribution. The probability according to r of observing 
a sample of length n with empirical frequency distribution q is approximately where b is 

the base of the logarithm function. 

Therefore, we see that if we are trying to decide between hypotheses ri, r2, . . . , when q is the 
empirical frequency distribution of the observed sample, then D(q\\ri) gives the relative weight of 
evidence in favor of hypothesis r ^ . 

The KL divergence arises in information theory as a measure of coding inefficiency. If Y is 
distributed according to q, then the average codeword length of the best code for Y is the entropy 
H{q) oiq: 

H{q) = - ^q{y) log q{y). 

However, if distribution r were (mistakenly) used to encode Y , then the average codeword length of 
the resulting code would increase by Z?((7||r). Therefore, if the divergence between q and r is large, 
then q and r must be dissimilar, since it is inefficient (on average) to use r in place of q. 

Finally, we look at the maximum entropy argument. The entropy of a distribution can be 
considered a measure of its uncertainty; distributions for which many outcomes are likely (so that 
one is "uncertain" which outcome will occur) can only be described by relatively complicated codes. 



The maximum entropy principle, first stated by Jaynes (1957 ), is to assume that the distribution 



underlying some observed data is the distribution with the highest entropy among all those consistent 
with the data - that is, one should pick the distribution that makes the fewest assumptions necessary. 
If one accepts the maximum entropy principle, then one can use it to motivate the use of the KL 
divergence in the following manner. The distribution f{y) — l/\y\ is certainly the a priori maximum 
entropy distribution. We can write 

D{q\\f) = ^q(2/)logg(j/)-^g(y)logr(y) 
yey yey 

= log|:y|-iJ(g). 

Maximizing entropy is therefore equivalent to minimizing the KL divergence to the prior r given 
above, subject to the constraint that one must choose a distribution that fits the data. 

To summarize, we have described three motivations for using the KL divergence. For the sake 
of broad acceptability, we have given both Bayesian arguments (those that refer to priors) and 
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non-Bayesian ones.p| These are by no means the only reasons. For further background, see Cover 
and Thomas (1991) and KuUback (1959|) for general information, Aczcl and Daroczy (1975) for an 
axiomatic development, and Rcnyi (1970) for a description of information theory that uses the KL 
divergence as a starting point. 

Some authors ( Brown ct al., 19*9^ ; [Church and Hanks, 1990 ; Dagan, Marcus, and Markovitch, 



1995; Luk, 1995) use the mutual information, which is the KL divergence between the joint dis- 
tribution of two random variables and their product distributions. Let A and B be two random 
variables with probability mass functions f{A) and g{B), respectively, and let h{A, B) be their joint 
distribution function. Then 



/(A, B) = D{h\\f ■9)^Y.ll ^) log 



aeAbeB 



h{a, b) 

JWWr 



(2.6) 



where A and B denote the sets of possible values for A and i3, respectively. The mutual information 
measures the dependence of A and B, for if A and B are independent, then h = f ■ g, which implies 
that the KL divergence between h and f ■ g is zero by the information inequality (theorem 2.1). We 
will not give the mutual information further consideration because we do not wish to attempt to 
estimate joint distributions. Indeed, Church and Hanks (1990) consider two words to be associated 
if the words occur near each other in some sample of text; but Hatzivassiloglou and McKeown (1992) 
note that the occurrence of two adjectives in the same noun phrase means that the adjectives cannot 
be similar. Thus, the information that joint distributions carry about similarity varies too widely 
across different applications for it to be a generally useful notion for us. 

While there are many theoretical reasons justifying the use of the KL divergence, there is a 
problem with employing it in practice. Recall that for distributions q and r, D{q\\r) is infinite if 
there is some y' G y such that r{y') — but q{y') is nonzero. If we know q and r exactly, then 
this is sensible, since the value y' allows us to distinguish between q and r with absolute confidence. 
However, often it is the case that we only have estimates q and f for q and r. If we are not careful 
with our estimates, then we may erroneously set f{y) to zero for some y for which q{y) > 0, with 
the effect that D{q\\f) can be infinite when D{q\\r) is not. 

There are sev eral ways around this problem. One is to use smoothed estimates, as described 
above in section |2.2| , for q and r; this is the approach taken in chapter ^. Another is to only 
calculate the KL divergence between distributions and average distributions. The work described in 
chapter ^ computes divergences to cluster centroids, which are created by averaging a whole class 
of objects. Chapter ^ describes experiments where we calculate the total divergence of q and r to 
their average; we examine some properties of the total divergence in the next subsection. 



2.3.2 Total Divergence to the Mean 



Equation (2.7) gives the definition of the total (KL) divergence to the mean, which appears in Dagan 



Lee, and Pereira (1997|) {A stands for "average") 

A{q,r 



Diq\\^) + D{r\\^ 



(2.7) 



where {{q + r)/2)(y) — {q{y) + r{y))/2. If q and r are two empirical frequency distributions (defined 
just above theorem pT^ ), then A{q, r) can be used as a test statistic for the hypothesis that q and r 
are drawn from the same distribution. 

Using theorem 2.1, we see that A{q,r) > 0, with equality if and only ii q — r. A{q,r) is clearly 
a symmetric function, but does not obey the triangle inequality, as will be shown below. 



^ The often heated debates between Bayesians and non-Bayesians are well known. For example, Skilling (1991 
24) writes, "there is a valid defence [sic] of using non-Bayesian methods, namely incompetence." 



Pg- 
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We can write A{q, r) in a more convenient form by observing that 



■D(<7||^— ) = 2^ (7(2/) log 



y^y liy) + '^(y) 



log2+ ^q(?/)log- 



y^y Qiy) + riy)' 

The sum over y Cz y may be broken up into two parts, a sum over those y such that both q(y) 
and r{y) are greater than zero, and a sum over those y such that q{y) is greater than zero but 
r{y) = 0. We cah these sets Both and Justq, respectively: Both — {y : q{y) > 0, r{y) > 0} and 
Justq = {y : q{y) > 0, r{y) = 0}. Then, 



R liv) + r{y) /t^. q{y) + r{y) 

yenotn y£,Justq 



log 2 + q{y) log ■ 



aeSot/i ydJustq 

liy) 



9(2/) + ' 

yenotn 

A similar decomposition of Z3(r ||2±I) into two sums over Both and Justr— {y : r{y) > 0,q{y) — 0} 
holds. Therefore, we can write 

^(g,r) = 21og2+ (ga7)log , , , +r{y)log , , , j ■ (2.8) 



eBoth 



Equation (2.8) is computationally convenient, for it involves sums only over elements of Both, as op- 
posed to over all the elements in y. We will typically consider situations in which Both is (estimated 

to be) much smaller than y. 

Since the two ratios in (2.8) are both less than one, the sum over elements in Both is always 
negative. A{q, r) therefore reaches its maximum when the set Both is empty, in which case A{q, r) = 
2 log 2. This observation makes it easy to see that A{q,r) does not obey the triangle inequality. Let 
y = {2/1,2/2}- Consider distributions q, f, and s, where 

Q{yi) = 1, 9(2/2) = 0; P{yi) = f(y2) = ^; S{yi) = 0, 5(2/2) = 1. 

Then A{q,f) + A{r,S) = log 2 + log(2/3) + 2 log(4/3) = log 2 + log(32/27) < 2 log 2, whereas A(q, S) = 
21og2, since the supports for q and s are disjoint. Therefore, A{q,r) + A{f,s) ^ A{q,s), violating 
the triangle inequality. 



2.3.3 Geometric Distances 



If we think of probability mass functions as vectors, so that distribution p is associated with the vector 
(^(2/1)1^(2/2)7 • ■ ■ iPiVN)) in then we can measure the distance between distributions by various 
geometrically-motivated functions, including the Li and L2 norms and the cosine function. All three 



of these functions appear quite commonly in the clustering literature ( Kaufman and Rousseeuw 
199C ; [Cutting et al., 1992 ; ^chiitze, 1993 ). The first two functions are true metrics, as the name 



norm suggests. 
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The Li norm (also called the "Manhattan" or "taxi-cab" distance) is defined as 

ii('?,r) = ^|(Z(2/)-r(y)|. (2.9) 
yey 

Clearly, Li{q,r) = if and only if q{y) = r{y) for all y. Interestingly, Li{q,r) bears the following 
relation, discovered independently by Csiszar and Kemperman, to D(q\\r): 



Li{q,r) < ^D{q\\r) -21X11, (2.10) 

where b is the base of the logarithm function. Consequently, convergence in KL divergence implies 
convergence in the Li norm. However, we can find a much tighter bound, as follows. By dividing 



up the sum in equation ( p.9[ ) into sums over Both, Justq, and Justr as defined in section 2.3.2, we 
obtain 

Li{q,r)^ ^ q(y)+ ^ r{y) + ^ \q(y) - r(y)\. 

yeJustq yeJustr yeBoth 

Since 

liy) 1 - ^(2/) and Y ^(y) = 1 " Y 

y&Justq y&Both yeJustr yeBoth 

we can express Li(q, r) in a form depending only on the elements of Both: 

Li{q,r)^2+ ^ Mv) ~ riy)\ ~ qiy) - r{y)) . (2.11) 
yeBoth 



Applying the triangle inequality to (2.11), we see that Li{q,r) < 2, with equality if and only if the 
set Both is empty. Also, (2.11) is a convenient expression from a computational point of view, since 
we do not need to sum over all the elements of 3^. We describe experiments using Li as distance 
function in chapter |^. 

The L2 norm is the Euclidean distance between vectors. Let 1 1 • 1 1 denote the usual norm function. 



'im = JEy'liyr- Then, 



L,{q,r)^\\q{y)-r{y)\\= { ^iqiy) - riy)^ 



yyey 

Since the Li norm bounds the L2 norm, the inequality of equation ( ^.10 ) also applies to the L2 
norm. 

Although the L2 norm appears quite often in the literature, Kaufman and Rousseeuw (1990| ) 
write that 

In many branches of univariate and multivariate statistics it has been known for a long 
time that methods based on the minimization of sums (or averages) of dissimilarities 
or absolute residuals (the so-called Li methods) are much more robust than methods 
based on sums of squares (which are called L2 methods). The computational simplicity 
of many of the latter methods does not make up for the fact that they are extremely 
sensitive to the effect of one or more outliers, (pg. 117) 

We therefore will not give further consideration to the L2 norm in this thesis. 

Finally, we turn to the cosine function. This symmetric function is related to the angle between 
two vectors; the "closer" two vectors are, the smaller the angle between them. 

, . Eyeyiiyyiy) 

cos(g, r) = 2.12) 

Iklllrll 
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Notice that the cosine is an inverse distance function, in that it achieves its maximum of 1 when 
q{y) = r(y) for aU y, and is zero when the supports of q and r are disjoint. For aU the other functions 
described above, it is just the opposite: they are zero if and only if q{y) — r{y) for aU y, and are 
greater than zero otherwise. Further analysis of geometric properties of the cosine function and 



other geometric similarity functions used in information retrieval can be found in Jones and Furnas 



(1987) 



The cosine function is not as efficient to compute as the other functions we have discussed. While 



the numerator in (2.12) requires only summing over elements of Both, the elements of Justq and Justr 
must be taken into account in calculating the denominator. It may be desirable to calculate the 
norms of all distributions as a preprocessing step (we cannot just normalize the vectors because we 
would violate the constraint that attribute vector components sum to one). 

2.3.4 Similarity Statistics 



There are many correlation statistics for measuring the association between random variables ( An- 



derberg, 1973, Chapter 4.2). The most well-known of these is the Pearson correlation coefRcient 
some non-parametric measures are the gamma statistic, Spearman's correlation coefficient, and Ken- 
dall's T coefficient ( pibbons, 1993 ). The Spearman statistic was used by Finch and Chater (1992) to 



find syntactic categories, and Kendall's statistic appears in work by Hatzivassiloglou and McKeown 



(1993) (henceforth H&M) on clustering adjectives. We concentrate on the latter statistic since we 
will discuss H&M's work in some detail in the next chapter. 

Kendall's r coefficient is based on pairwise comparisons. For every pair of contexts (i/i, yj), 
we consider the quantities a^^ = qijji) — q{yj) and a*-' = r{yi) — r{yj). The pair is a concordance 
if both and aj,-' have the same sign, and a discordance if their signs differ (if either of these 
quantities is zero, then the pair is a tie, which is neither a concordance nor a discordance). T{q, r) is 
the difference between the probability of observing a concordance and the probability of observing 
a discordance, and so ranges between —1 and 1. A value of 1 corresponds to perfect concordance 
(but not necessarily equality) between q and r, — 1 corresponds to perfect discordance, and to no 
correlation. An unbiased estimator of T{q, r) is 

number of observed concordances — number of observed discordances 

^ W) ■ 

In terms of computational efficiency, T{q, r) is slightly more expensive then the total divergence to 
the mean or the Li norm. In order to calculate the number of discordances, H&M first order the y's 
in 3^ by their probabilities as assigned by q. Then, they rcrank the ?/'s according to the probabilities 
assigned by r. The number of discordances is then exactly the number of discrepancies between the 
two orderings. Since we need to sort the set y and calculate the number of discrepancies between 
the two orderings, we spend 0(|3^|log2 |3^|) time to calculate the similarity between q and r. An 
optimization not noted by H&M is that for all y' e BothU JustqU Justr and y" ^ BothU JustqU Justr 
(that is, q{y") — r{y") — 0), the pair {y',y") cannot be a discordance - it is a concordance if 
y' £ Both and a tie otherwise. Therefore, we actually only need to sort y' ~ Both U Justq U Justr, 
a 0(|3^'|log2 |3^'|) operation. In the case of sparse data, this would be a significant time savings, 
although we would still be using more than linear time. 

2.3.5 An Example 

To aid in visualizing the behavior of the salient functions described above, we consider a two- 
dimensional example where y = {2/1,2/2}- In this situation, (7(1/2) = 1 — qiyi) for any distribution 
q, so we only need to know the value of a distribution at yi. In figure we have plotted the 
values of various distance functions with respect to a fixed distribution r = (.5, .5). The horizontal 
axis represents the probability of yi, so that .75 on the horizontal axis means the distribution 
q — (.75, .25). The fixed distribution r is at .5 on the horizontal axis. 
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Distances to (or from) distribution r, r(y1)= r(y2) =.5 



1 1 1 1 


1 1 1 1 1 

D(q||r) 




D(r||q) 




A(q,r) 




L1(q,r) 




cos(q,r) ^ 


T 







2.5 



1.5 



0.5 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

q(y1) 



Figure 2.1: Comparison of distance functions 



As observed above, the KL divergences, the total divergence to the mean, and the Li norm are 
all zero at r and increase as one travels away from r. The cosine function, on the other hand, is 1 
at r and decreases as one travels away from r. 

Figure 2.1 demonstrates that the KL divergence is not symmetric, for the curve D(r||g) lies 
above the curve D{q\\r). In general, the KL divergence from a sharp to a flat distribution is less 
than the divergence from a flat to a sharp distribution - a sharp distribution (such as (.9, .1)) is 
one with relatively high values for some of the attributes, whereas a flat distribution resembles the 
uniform distribution. The intuition behind this behavior is as follows. If we assume that the source 
distribution (the second argument to Z3(-||-)) is flat, then it would be somewhat odd to observe a 
sharp sample distribution. However, it would be even more surprising to observe a flat sample if 
we believe that the source distribution is sharp. For instance, suppose the source distribution were 
(.5, .5). Then, the probability of observing 9 yi's and 1 2/2 in a sample of length 10 (i.e., a sharp 
empirical distribution) would be 

^'°V.5)^(.5)i « .01. 



However, if the source distribution were (.9,.l) 
(i.e., a flat empirical distribution) would be 



then the probability of observing 5 j/i's and 5 j/2's 



(.9)^(.l) 



.001. 



An interesting feature to note is that the curve for A{q,r), the total divergence to the mean, is 
lower than the KL divergence curves, and that these, in turn, are for the most part lower than the 
Li curve. We speculate that the flatness of D{q\\r) and A{q,r) relative to Li{q,r) around the point 
q = r indicates that these two functions are somewhat more robust to sampling error, for using 
q = r + e (for small e) instead of g = r results in a much greater change in the value of the Li norm 
than in the value of the KL divergence or the total divergence to the mean. 
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2.4 Summary and Preview 



We have now established the groundwork for the results of this thesis. We have explained why we 
want to use distributions to represent objects, and have described ways to estimate these distributions 
and to measure the similarity between distributions. 

We have been working with conditional probabilities induced by objects over contexts. As men- 
tioned above, "objects" and "contexts" are fairly general notions; for instance, an object might be a 
document and the contexts might be the set of words that can occur in a document. We will confine 
our attention to modeling pairs of words, so that X and y are sets of words. In chapters ^ and ^, 
X is a, set of nouns and 3^ is a set of transitive verbs; C{x,y) indicates the number of times x was 
the direct object of verb y. Chapter H considers the bigram case, where X is the set of all possible 
words, y = X, and C{x,y) denotes the number of times word x occurred immediately before the 
word y. 
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Chapter 3 

Distributional Clustering 



This chapter describes the first of our similarity-based methods for estimating probabihties. The 
probabihstic, hierarchical distributional clustering scheme detailed here is a model-based approach, 
where the behavior of objects is modeled by class behavior. The following two chapters describe a 
nearest-neighbor approach, where we base our estimate of an object's behavior on the behavior of 
objects most similar to it, so that no class construction is involved. 



3.1 Introduction 

Much attention has been devoted to the study of clustering techniques, and indeed whole books have 
been written on the subject (Anderberg, 1973; Hartigan, 1975| ; Kaufman and Rousseeuw, 1990) 



Traditional applications of clustering include discovering structure in data and providing summaries 
of data. We propose to use clustering as a solution to sparse data problems: by grouping data into 
similarity classes, we create new, generalized sources of information which may be consulted when 
information about more specific events is lacking. That is, if we wish to estimate the probability of 
an event £ that occurs very rarely in some sample, then we can base our estimate on the average 
behavior of the events in 5's class(es); since a class encompasses several data points, estimates of 
class probability are based on more data than estimates of the probability of a single event. For 
example, suppose we wish to estimate the graduation rate of Asian-American females enrolled at 
Westlake High School in Westlake, Ohio. If there is only one Asian- American female at WHS, then 
we will not have enough data to infer the right rate (we would probably have to guess either 100% 
or 0%). Suppose, however, that we consider a group of high schools that are similar to WHS (e.g., 
public high schools in suburban areas in Ohio). Then, we can average together information about 
Asian- American females attending schools in that group to make a better estimate. 

To our knowledge, all clustering algorithms in the natural language processing literature create 
"hard" or Boolean classes, with every data point belonging to one and only one class. In other 
words, these algorithms build partitions of the data space. The combinatorial demands of such hard 
clustering schemes are enormous, as there are { ^ } ways to group n observations into k non-empty 
sets, where 

1=0 



is a Stirling number of the second kind (Knuth, 1973|). There are a huge number of possible groupings 



even for small values of k and n: Hatzivassiloglou and McKeown (199S ) observe that one can divide 



twenty-one points into nine sets in approximately 1.23 x 10 ways. As it turns out, the problem of 



finding a partition that minimizes some optimization function is NP-complete ( Brucker, 197S ), so, 
not surprisingly, most hard clustering algorithms resort to greedy or hill-climbing search to find a 
good partition. 

Greedy and hill-climbing approaches all first create an initial clustering and then iteratively 
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make local changes to the clustering in order to improve the value of some optimization function. 
Let k be the desired number of clusters. Update methods begin with k initial classes chosen in some 
fashion, and repeatedly move data points from one class to another. The number of clusters therefore 
stays (about) the same from one iteration to the next. Two special cases of update methods are 
medoid and centroid methods, both of which represent clusters by data points. Medoids are actual 
data points, whereas centroids are "imaginary" data points created by averaging together object 
distributions ( Kaufman and Rousseeuw, 199C| ). Cluster membership is decided by assigning each 



object to the closest cluster representative, where "closeness" is measured by some distance function. 
Each iteration step consists of first moving some representative in order to improve the value of the 
optimization criterion, and then updating cluster memberships. 

Non-update methods, where the number of clusters varies during the course of the clustering, 
include divisive and agglomerative clustering. Divisive algorithms start with one universal class to 
which all the data points belong; each iteration involves choosing one of the current set of classes 
to split into two new classes. Agglomerative algorithms, in contrast, begin with each data point 
belonging to its own class; then, in each iteration step, some pair of current classes is merged to 
form a new, larger class. In either case, the choice of which class to split or which classes to merge 
is generally made by picking the class or classes whose division or combination results in the largest 
improvement in the optimization function, and the process stops once k clusters have been formed. 

Both divisive algorithms and agglomerative algorithms, if allowed to run until all classes have 
been merged into one, readily yield hierarchical clusterings, which can be represented by dendrograms 
(essentially, binary trees) . At the root of the dendrogram is the class containing all the data points 
(the first class considered in the divisive case and the last class formed in the agglomerative case). 
Each node ry in the dendrogram represents a class, denoted by class (77). Nodes 771 and 772 are children 
of node 77' if at some iteration step either class(77') was divided into class(77i) and class(772), or class(77i) 
and class(772) were agglomerated into class(77'), depending on which type of clustering algorithm was 
used. 

While the class hierarchy produced may of course itself be of interest, an appealing aspect of 
hierarchical clustering is that it provides an attractive solution to the problem of deciding on the 
right number of clusters. The partitioning methods mentioned above generally take the number 
of clusters k as an input parameter rather than deciding what the right number of clusters is. As 
Anderberg (197^ ) writes, "Hierarchical clustering methods give a configuration for every number 



of clusters from one (the entire data set) up to the number of entities (each cluster has only one 



member)" (pg. 15). However, both Anderberg (1973) and Kaufman and Rousseeuw (1990) express 



reservations about hierarchical methods: 

A hierarchical method suffers from the defect that it can never repair what was done 
in previous steps. Indeed, once an agglomerative algorithm has joined two objects, 
they cannot be separated.... Also, whatever a divisive algorithm has split up cannot be 
reunited. The rigidity of hierarchical methods is both the key to their success (because it 
leads to small computa tion times) and their main disad vantage (the inability to correct 



erroneous decisions)." ( Kaufman and Rousseeuw, 199C , pp. 44-45) 



We propose a novel "soft" (probabilistic) hierarchical clustering method that overcomes this rigid- 
ity problem. Instead of each data point belonging to one and only one class, we assign probabilities 
of class membership, with every data point belonging to every class with positive probability. Since 
we reestimate membership probabilities at each iteration, there is no sense in which data points can 
be permanently assigned to the same or separate classes. 

Probabilistic clusterings have another advantage: they provide a more descriptive summary of 



the data. Consider the situation depicted in figure 3.1, where circle B is halfway between A and C. 
Suppose that two clusters are desired. A hard clustering is forced to associate B with only one of 
the other circles, say, A. It then reports that the partition found is {{A, B}, {C}}, which does not 
convey the information that B could just as well have been grouped with C . A soft clustering, on 
the other hand, can state that B belongs to ^'s cluster and C"s cluster with equal probability, and 
so can express the ambiguity of the situation. 
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Figure 3.1: An ambiguous case 



In brief, our clustering method is a centroid-based, probabilistic, divisive, hierarchical algorithm 
for associating abstract objects by learning their distributions. Each class is represented by a cen- 
troid, which is placed at the cluster's (weighted) center of mass. For each object x and each centroid 
c, we calculate a membership probability P{c\x) that x belongs to c. Our method begins by creating 
a single centroid, with each object belonging to that centroid with probability one, and then itera- 
tively splits one of the current centroids and reestimates membership probabilities. The creation of 
child centroids from parent centroids creates a hierarchy of classes in the obvious way. 

As decided upon in section 2J, objects (both data points and centroids) will be represented by 
distributions over a set y of contexts. We will use the KL divergence, discussed at length in section 



2.3.1 , as distance function. Our optimization function is the free energy, a quantity motivated 
by statistical physics; the algorithm uses deterministic annealing to find phase transitions of the 
free energy, and splits cluster centroids at these transitions. Each time we update the annealing 
parameter, we reestimate the location of the cluster centroids and the membership probabilities for 
each object. 

We shall be especially interested in the problem of clustering words, although our theoretical 
results will be described in a general fashion. We re-emphasize that our clustering method can be 
used for clustering any objects that can be described as distributions, and indeed future work involves 
employing our techniques for clustering documents. We evaluate our method on tasks involving the 
prediction of object-verb pairs, and find that it greatly reduces error rate, especially in cases where 
traditional methods such as Katz's back-off method (see section 2.2) would fail. 



3.2 Word Clustering 

Methods for automatically classifying words according to their contexts are of both scientific and 
practical interest. The scientific questions arise in connection with distributional views of linguistic 
(particularly lexical) structure and also in relation to the question of lexical acquisition. From a 
practical point of view, word classification addresses issues of data sparsity and generalization in 
statistical language models, especially models used to decide among alternative analyses proposed 
by a grammar. 

It is well known that a simple tabulation of frequencies of certain words participating in certain 
configurations (for example, frequencies of pairs of transitive main verbs and head nouns of the 
verbs' direct objects), cannot be reliably used for comparing the likelihoods of different alternative 
configurations. The problem is that for large samples, the number of possible joint events is much 
larger than the number of event occurrences in the sample, so many events occur rarely or even not 
at all. Frequency counts thus yield unreliable estimates of their probabilities. 

Hindle (1990) proposed dealing with the data sparseness problem by estimating the likelihood 
of unseen events from that of "similar" events that have been seen. For instance, one may estimate 
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the likelihood of a particular adjective modifying a noun from the likelihoods of that adjective 
modifying similar nouns. This requires a reasonable definition of noun similarity and a method for 
incorporating the similarity into a probability estimate. In Hindle's proposal, words are similar if 
there is strong statistical evidence that they tend to participate in the same events. His notion of 
similarity seems to agree with our intuitions in many cases, but it is not clear how to use this notion 
to construct word classes and corresponding models of association. 

In this chapter, we build a similarity-based probability model out of two parts: a model of 
the association between words and certain hidden classes, and a model of the behavior of these 
classes. Some researchers have built such models from preexisting sense classes constructed by 
humans; for example, Rcsnik (1992D_ uses WordNet, and Yarowsky (1992b| ) works with Roget's 



thesaurus. As mentioned in chapter y, however, we are interested in ways to derive classes directly 
from distributional data. Rcsnik's thesis contains a discussion of the relative advantages of the two 



approaches ([Resnik, 1993| ) 



In what follows, we will consider two sets of words, the set X of nouns, and the set y of 
transitive verbs. We are interested in the object-verb relation: the pair {x,y) denotes the event that 
noun X occurred as the head noun of the direct object of verb y. Our raw knowledge about the 
relation consists of the frequencies C{x, y) of particular pairs [x, y) in the required configuration in 
a training corpus. Some form of text analysis is required to collect these pairs. The counts used 
in our first experiment were derived from newswire text automatically parsed by Hindle's parser 



Fidditch (Hindlc, 1994). Later, we constructed similar frequency tables with the help of a statistical 
part-of-s pccch tagger ([Church, 1988 ) and tools for regular expression pattern-matching on tagged 



corpora (Yarowsky, 1992a). We have not compared the accuracy and coverage of the two methods 
or studied what biases they introduce, although we took care to filter out certain systematic errors 
(for instance, subjects of complement clauses for report verbs like "say" were incorrectly parsed as 
direct objects). 

We only consider the problem of classifying nouns according to their distribution as direct objects 
of verbs; the converse problem is formally similar. For the noun classification problem, the empirical 
distribution of a noun x is given by the conditional density 

p /IN C{x,y) C{x,y) 

where C{z) denotes the number of times event z occurred in the training corpus. The problem we 
study is how to use the Pmle{'\x) to classify the x X. Our classification method will construct 
a set C of clusters c and cluster membership probabilities P{c\x). Each cluster c is associated with 
a cluster centroid distribution P(y\c), which is a discrete density over y obtained by computing a 
weighted average of the noun distributions PMLsi'lx). We will move freely between describing a 
noun (or centroid) as x (or c) and as Pa/l_e(-|x) (or P{-\c)). 

To cluster nouns x according to their conditional verb distributions Paile{'\x) , we need a measure 



of similarity between distributions. We use for this purpose the KL divergence from section 2.3.1: 

i^(9ik) = E'?(2^)iog44- 

yTy 

The KL divergence is a natural choice for a variety of reasons, most of which we have already dis- 



cussed in section 2.3.1 . As mentioned there, D{q\\r) measures how inefficient on average it would 
be to use a code based on r to encode a variable distributed according to q. With respect to our 
problem, D{Pmle{-\x)\\P{-\c)) thus gives us the loss of information in using the centroid distribu- 
tion P(-|c) instead of the empirical distribution PMLsi'l^) when modeling noun x. Furthermore, 
minimizing the KL divergence yields cluster centroids that are a simple weighted average of member 
distributions, as we shall see. 

One technical difficulty is that D{q\\r) is infinite when r{y) = but q{y) > 0. Due to sparse 
data problems, it is often the case that PMLE{y\x) is zero for a particular pair {x,y). We could 
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sidestep this problem by smoothing zero frequencies, perhaps using one of the methods described in 
section 2.2. However, this is not very satisfactory because one of the goals of our work is precisely 
to avoid data sparsity problems by grouping words into classes. As it turns out, the difhculty is 
avoided by our clustering technique: instead of computing the KL divergence between individual 
word distributions, we only calculate divergences between word distributions and cluster centroids. 
Since centroids are average distributions, they are guaranteed to be nonzero whenever the word 
distributions are. This is a useful advantage of our method over techniques that need to compare 
pairs of individual objects, since estimates for individual objects are prone to inaccuracies due to 
data sparseness. 

The organization of the rest of this chapter is as follows. We develop the theoretic al b asis for our 
clustering algorithm in section 3^ . We present some example cluste ring s in section ^.4| in order to 
get a sense of the qualitative performance of our algorithm. Section 3^ presents two evaluations of 
the ability of our cluster-based probability estimation method to estimate word pair probabilities, 
especially in situations wher e da ta is sparse; we show that indeed, our method does a good job of 
modeling. Finally, in section we review other work in the NLP community on clustering words, 
and briefly touch upon soft clustering methods from other fields. 



3.3 Theoretical Basis 



Our general problem can be seen as that of learning the joint distribution P{x, y) of pairs in A" x 3^ 
from a large sample. The training data is a sample S' of n independently drawn pairs 



l<i<n. 



We assume that each Xj G X and yi G y occurs in the sample at least once (we cannot train a model 
for a; or y if we have no information about them). 

The line of argument in this section proceeds as follows. We first set up the general form of 
our cluster-based probability model. We determine the two principles, minimum distortion and 
maximum entropy, that guide our search for the proper parameter settings for the model, and 
combine these two principles into the free energy function. Sections 3.3.1 and 3.3.2 go into the 
details of how we set the parameters by maximizing entropy and minimizing distortion. Finally, 
section 3.3.3 describes how searching for phase transitions of the free energy yields a hierarchical 



clustering. 

In order to estimate the likelihood of the sample, we need a probability model P{x, y) = 
P{x)P{y\x). We would like to find a set of clusters C, each represented by a cluster centroid c, 
such that each conditional distribution P{y\x) can be decomposed as 



P(y|x) = ^P(c|x)P(y|c). 

cGC 



(3.1) 



P(c\x) is the membership probability that x belongs to c, and P{y\c) is j/'s probability according to 
the centroid distribution for c: as stated above, centroids are representative objects, and so form a 
distribution over y just like objects do. Ideally, the objects that belong most strongly to a given 
cluster would be similar to one another. 



According to equation (3.1), then, we estimate the probability of y given x by taking an average 
of the centroid distributions, weighting each P{y\c) by the probability that x belongs to c. We thus 
make a Markovian assumption that the association of x and y is made solely through the clusters, 
that is, that y is conditionally independent of x given c. The cluster model drastically reduces the 
dimension of the model space, since the number of (c, x) and (c, y) pairs should be much lower than 
the number of possible {x, y) pairs. 



Given the decomposition of P{y\x) in equation (3J_), we can write the likelihood assigned by our 
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model to a pair as 



y)^Y. P{x)P{c\x)P{y\c). (3.2) 

cec 



We will assume that the marginals for x ^ X are not part of our model and so can be considered 
fixed; to indicate this, we will write P{x) instead of P{x). Without loss of generality, we assume 



that P{x) is greater than zero for all x. In order to flesh equation (3.2) out, then, we need only find 
suitable forms for the cluster membership distributions P{c\x) and centroid distributions P{y\c). We 
will be guided by two principles: first, that our model should fit the data well (otherwise, our model 
is not useful), and second, that our model should make as few assumptions as possible (otherwise, 
our model is not general). 



Goodness of fit is determined by the distortion of the model. Equation (3.1) estimates the 



probability of y given x by randomly selecting a cluster c according to distribution P (c|x), and 



then using P{y\c) to estimate the (conditional) probability of y. Recall from section 2.3.1 that 
D{Pmle{'\x)\\P{-\c)) measures the inefficiency of using c's distribution rather than x's maximum 
likelihood distribution to code for x. The distortion T) is the average coding loss incurred by our 
model: 

V = Y,P{x)Y,P{c\x)d{x,c), (3.3) 

X c 

where d{x,c) is notational shorthand for D[Pmle{'\x)\\P{-\c)). 

As it turns out, the distortion equation does not give us enough information to find good closed- 
form expressions for the membership probabilities. In fact, without any other constraints, the 
cluster system that minimizes distortion is the one in which there is one centroid placed on top of 
each object, with each object belonging only to the centroid it coincides with. Therefore, we add 
the requirement that the membership assignments make the fewest assumptions possible, that is, 
that the probability that an object belongs to a centroid should not be any higher than it needs 



to be. This requirement corresponds to the maximum entropy principle, described in section 2.3.1 . 
Therefore, we wish to maximize the configuration entropy 

i? = -^P(x)^P(c|x)logP(c|x), (3.4) 

X c 

which is the average entropy of the membership probabilities. 

We can combine distortion and entropy into a single function, the free energy^ which appears in 



work on statistical mechanics (Rose, Gurewitz, and Fox, 1990) 



F = V-H/p. (3.5) 



This function is not arbitrary; indeed, at maximum entropy points (see section 3.3.1), we can show 
that 

dF 

H = and (3.6) 

(3.8) 

where T = 1//3. The minima of F are of special interest to us, since such points represent a balance 
between the "disordering" force of maximizing entropy and the "ordering" force of minimizing dis- 
tortion. In fact, in statistical mechanics, the probability of finding a system in a given configuration 
is a negative exponential in F, so the system is most likely to be found in its minimal free energy 
configuration. /? is a free parameter whose interpretation we will leave for later. 

Suppose we fix the number of clusters \C\. Clearly, (local) minima of F occur when the entropy 
is at a (local) maximum and, simultaneously, the distortion is at a (local) minimum (although 
critical points of F need not correspond to critical points of V and H). However, it is difficult to 
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jointly maximize entropy and minimize distortion, since the location of cluster centroids affects the 
membership probabilities, and vice versa; that is, the P{y\c) and the P{c\x) are not independent. We 
therefore simplify the search for minima of F by breaking up the estimation process into two steps. 
First, we hold the distortion and centroid distributions fixed, and maximize the entropy subject 
to these constraints. Since the distortion is regarded as constant in this step, maximizing entropy 
corresponds to a reduction in free energy. Second, we fix the membership probabilities at the values 
derived in the first step, and thus can treat the entropy as a constant. We then find a critical point of 
F with respect to the centroid distributions; it turns out that this critical point is in fact a minimum 
of the distortion, and therefore free energy is reduced once again. Moving the centroid distributions 
may change the values of the membership probabilities that maximize entropy, though, and so we 
repeat these two steps until a stable configuration is reached. This two-step estimation iteration i s 
reminiscent of the EM (Estimation-Maximization) algorithm ( Dempster, Laird, and Rubin, 1977 ) 
commonly used to find maximum likelihood solutions. 

Before we continue, we review the notation that will be used in the following sections. Model 
probabilities are always marked with a tilde (P) . The model parameters are the membership proba- 
bilities P{c\x) and the centroid distributions P{y\c). The object marginal probabilities P{x) = P{x) 
are not considered part of the model, and so are regarded as positive constants throughout.^ The 
centroid marginals P{c) are given by P{c) = J2x Pic\x)P{x); this form ensures that J2x P(Mc) = 1. 
Empirical frequency distributions are denoted by Pmle and are considered fixed by the data. By 
assumption, for all y there exists an object x such that PMLE{y\x) > 0. The quantity d{x,c) is 
shorthand for the KL divergence D{Pmle{'\x)\\P{-\c)). We summarize this information in table 
O 



Quantity Value 



Notes 



P{c\x) 
P{y\c) 
P{x) 

He) 

P{x\c) 



Pmle 
d{x, c) 



P{x) 

Y.^P{c\x)P{x) 
P{c\x)P{x)/P{c) 
{y\x) C{x,v)IC{x) 



(to be determined) 

(to be determined) 

fixed at positive values 

determined by P{c\x) 

determined by P{c\x) 

fixed by data; Vy, 3x : PMLE{y\x) > 



D{Paile{-\x)\\P{-\c)) determined by P{y\c) 



Table 3.1: Summary of common quantities 

We will use natural logarithms in this chapter, so that the base of the logarithm function is e; 
using another base would not not substantially alter our results, but we would have extra constant 
factors in most of our expressions. The next two subsections assume that the number of clusters has 
been fixed. 



3.3.1 Maximum-Entropy Cluster Membership 

This section addresses the first parameter estimation step of finding the cluster membership proba- 
bilities P{c\x) that maximize the configuration entropy, and hence reduce the free energy, assuming 
that the distortion and the centroid distributions are fixed (it does not suffice simply to hold the 
centroid distributions fixed, since we see from equation (3^) that the distortion depends on the mem- 
bership probabilities, too). We will make the further assumption that for all centroids c, P{y\c) > 
for all y; this assumption is justified in the next section. 



^Our implementation sets P{x) = instead of Pmle{x), since we are interested in distributional modeling 

without regard to the frequencies of particular nouns. 
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Recall the definition of the configuration entropy H from equation (3.4): 



= - ^ P{x) ^ P{c\x) \ogP{c\x). 

X c 

We wish to maximize this quantity subject to two constraints: the normalization constraint that 
P{.c\x) = 1 for all X, and the distortion constraint that V — K for some constant K. We therefore 
take the variation of the function i/+: 

H+=H-Y,(^. (^P(c|x)-l] -/3(^P(x)^P(c|x)d(x,c)-i^J , 

X \ C / \ X c / 

where ax and f3 are Lagrange multipliers. It is important to note that we are using /3 both here as 
a multiplier and as a normalization term in the free energy ( |3.5| ). 

We now calculate the partial derivative of H'^ with respect to a given membership probability 
P(c\x), since fixing the centroid distributions means that the P(c\x) are independent (except for 
their association through the fixed distortion). 

^x - (3P{x)d{x, c) 



dP{c\x) dP{c\x) 



P{x)P{c\x) + P{x) log P{c\x) \~ax- pPix)d{x, c) 

P[c\x) 



= -Pix) 1 + log P{c\x) + -±- + I3d{x, c] 

(there is no problem with division by P{x) since we assumed all object marginals are positive). 
At critical points of we have that dH^ / dP{c\x) = 0. This allows us to solve for P{c\x): 

Pic\x) ^e^m^,c)^-{i+a')^ 

where a' = ax/P{x). Since a' is meant to insure the normalization of P(c|x), a' must be set to a 
value such that the following is satisfied: 



e 



l+a' _ ^ ^-lid(x,c) def 



(Z is standard notation for partition (normalization) functions; the name comes from the German 
Zustandsumme) . We therefore have a closed-form solution for the membership probabilities: 



g -I3d{x,c) 



P{c\x) = . (3.9) 



It was shown by Jaynes (1983 ) that the exponential form (3^) gives not just a critical point but the 



maximum of the entropy, and so we have a maximum entropy estimate of membership probability, as 



desired. The expression (3.9) is intuitively satisfying because it makes the membership probabilities 
dependent on distance (in the KL divergence sense): the farther x is from c, the less likely it is that x 
belongs to c. Furthermore, given that the centroid distributions were fixed at positive values for all 
y, d{x, c) is always defined, which means that all membership probabilities are positive; each object 
has some degree of association with each cluster. For each P{c\x), we need to calculate d{x, c) which 
is a sum over all y G y, so the time to update all the membership probabilities is 0(|A:'||C||3^|). 
However, if the object distributions are sparse, then the computation of d{x, c) will be significantly 
faster. 



There is a pleasing relationship between expression (3.9) for P{c\x) and an estimate given by 
theorem b.2l restated here: 
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Theorem 2.2 Let r be a hypothesized source distribution. The probability according to r of observing 
a sample of length n with empirical frequency distribution q is approximately ^""^('JII'')^ where b is 
the base of the logarithm function. 

Thus, the maximum entropy membership probabihty P{c\x) = e~^'''^^''^^ /Z^ corresponds to the 
probabihty of observing object distribution PMLsivlx) if the source distribution is assumed to be 
the centroid P{y\c), except that f3 has replaced the sample size n. Therefore, if we regard (3 not as a 
Lagrange multiplier but as a free parameter, we can in some sense control the sample size ourselves. 
If we use a high value of f3, then we express strong belief in the maximum likelihood estimate Pmle 
(as would be the case for a very large sample), so that the probability that x belongs to a centroid c 
is negligible unless c?(a;, c) is very small. Conversely, a low value of (3 is equivalent to a small sample, 
in which case we do not trust the MLE and so allow P{c\x) to be high even if x is relatively distant 



from c. Section 3.3.3 describes how we vary (3 in order to derive a hierarchical clustering. 

We conclude this section by observing that at the maximum entropy membership probabilities, 
the free energy can be rewritten as follows: 

F = V-^^H 

= E E P{^)P{c\x) [d{x, c) + i logP(c|a;) 

= EE -P(a;)-P(c|a;) {^{x, c) - d{x, c) - log Z.^ (substitution of (3.9)) 



i^P(x)logZ,^P(c|x) 



= -^E^(^)log^- (3.10) 

X 

where the last step is justified since we ensured the normalization of the maximum-entropy mem- 
bership probabilities. By simple differentiation, it is easy to see that if we set T = 1/(3, then 
dF/dT = -I3F - 13V = -H, and d{f3F)/d(3 = V. This gives us equations (jsj) and (|j|), as 
desired. 



3.3.2 Minimum-Distortion Cluster Centroids 



We now proceed with the second estimation step. We fix the membership probabilities P{c\x) at 
their maximum entropy values, calculated above, so that the configuration entropy can now be 
considered a constant, and the expression for the free energy is given by equation (3.1C). 

Now that the membership probabilities have been fixed, the individual centroid distributions are 
all independent and we just need to find values for them that minimize F , subject to the constraint 
that 'Y^y P{y\c) = 1 for all centroids. What we will do is first find a critical point of F (equations 
( pll ) through (pH)), and then prove in lemma |3.1| that this critical point is in fact a minimum by 
showing that it minimizes the distortion T>. 

In order to find a critical point of F, we take partial derivatives of 

F^ = --^Y. p(^) - E > ( E p(y\^) - M ' 

a; c \ y / 



where 7c is yet another Lagrange multiplier and we use expression ( ^3.10 ) for F. 

The partial derivative of P+ with respect to a given P{y\c) is calculated as follows: 



dF+ 
dP{y\c) 



d 



Zx dP{y\c) 



E 
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V dP{y\c 

5:P(a:)P(c|x)^|^-7c. (3.11) 
V 9P{y\c) 



The variation of d(x, c) with respect to P{y\c) is 



dd{x,c) d / n / 'I M PMLE{y'\x) 



PMLE{y\ 



X 



P(y|c) 



(3.12) 



so the centroid distribution term P(y\c) reappears. Substituting (3.12) into (3.11), we have 

dF+ \-p( I U PMLEiy\x) 

= 2^P[x)P(c\x) 



dP{y\c) ^ ^ ' ' V P{y\c) 

= -^^—^Yl P{x)P{c\x)PMLE{y\x) - 7c. 

At a critical point of F+, the partial derivative of must be 0, which allows us to solve for P{y\c): 

P{y\c) = -Y. P{x)P{c\x)PMLE{y\x) (3.13) 

X 

The multiplier 7^ is meant to enforce the constraint that P{y\c) = 1, so 

1 = E-EP(x)P(c|x)FAfLis(2/|a;) 

= 1^P(.x)F(c|.t). 
7c ^ 



Therefore, 7c = ^(^)^('^l^) ~ upon substitution of this into (3.13), we finally obtain the 

centroid distributions: 

P{y\c) = P{x\c)PMLE{y\x). (3.14) 

X 

We thus have a natural expression for a cluster centroid c: it is an average over all data points 
x, weighted by the Bayes inverse of the probability that x belongs to c. The Bayes inverses are 
all positive since the maximum-entropy membership probabilities are, so the centroid distribution 
cannot be zero for any y since we assume that PMLE{y\x) is nonzero for at least one x. It is clear 
that the time required to update all the centroid distributions is OdJ^HCHA"!) in the worst case; 
again, however, the c ompu tation is much faster if the object distributions are sparse. 

Now, expression ( 3.14 ) gives us the unique critical point of F when the entropy is held fixed; 
but is the free energy actually reduced at this point? Our goal, after all, is to look for minima of 
F. Since the entropy was held fixed, it suffices to show that the centroid distributions ( 3.14 ) yield 
a minimum of the distortion, which we do in the following lemma. 

Lemma 3.1 // the distortion T) has exactly one critical point with respect to centroid distributions 
P(y|c), < P{y\c) < 1, then that critical point is the unique minimum ofV, assuming the cluster 
membership probabilities P(c\x) are fixed. 
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Proof. Since 



V = ^P(x)^P(c|a:)d(x,c) 

X C 

= E ^(^) E ^(^1^) E ^MLB(yk) log PMLB(yk) - 

X c y 

P{x) E ^(^1^) E PMLE{y\x) logP(y|c), 

X c y 

and the centroid distributions are independent when the membership probabilities are fixed, it is 
sufficient to maximize for each centroid c the quantity 

V, = Y.P(x)P{c\x)Y,PMLE{y\x)\0gP{v\c) 

X y 

= J2 {^ogP{y\c))j2Pi^)Pic\x)PMLE{y\x) 

V X 

= ^log(p(j/|c)«(^-^)) 

y 

where Q{c,y) — 'Y^r^P{x)P{c\x)PMLE{y\x) does not depend on P(y|c). But since the logarithm is 
a strictly increasing function, we need only find a maximum of the product 

]Jp(y|c)Q(-.s'). (3.15) 
y 

Observe that this product (unlike the logarithm, which is why we had to do all this equation 
rewriting) is continuous on the domain {P[y\c) : < P{y\c) < 1}, which is closed and bounded. 



Therefore, we know from analysis that (3.15) achieves both its maximum and its minimum on its 
domain. Since clearly every point on the boundary of the domain yields a minimum value (zero), 
the unique critical point must be the maximum of ( |3.15 ) and thus the minimum of D. 



Now, since the fixed membership probabilities determine the entropy, any critical point of F 
must also be a critical point of the distortion because dF = dV if i/ is a constant. Therefore, 



the centro id d istributions ( 3.14 ) define the unique critical point of the distortion, and application 
of lemma ^.l| tells us that this is indeed the minimum of T). Thus, we have succeeded in finding 
centroid distributions which minimize distortion and therefore reduce the free energy. 

3.3.3 Hierarchical Clustering 

In the previous two sections, we developed maximum entropy estimates for membership probabilities 
and minimum distortion estimates for centroid distributions: 



P{c\x) = exp(-/3d(x,c))/Z^ (U), and 

P{y\c) = Y.P^'=\''^PMLE{y\x) dH). 



Our search for minima of P at a fixed /? is a two-step iteration described in section 3.3. First, we 



set the membership probabilities at their maximum entropy values (p.9[), using the current centroid 



distributions. Then, we plug these membership probabilities into (3.14) to update the centroid 
distributions. We repeat this two-step cycle until the parameters converge to steady states. 

Now, this two-step iteration lets us find cluster centroids and membership probabilities for a 
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fixed number of clusters. However, we have not yet shown how the number of clusters is chosen. 
The inclusion of the parameter (3 in the free energy expression 



F = V- H/P 

suggests the use of a deterministic annealing procedure for clustering (Rose, Gurcwitz, and Fox 



199C), in which the number of clusters is determined through a sequence of phase transitions by 



continuously increasing /? according to an annealing schedule. 



As discussed in section 3.3.1, P plays a role similar to sample size and thus controls the impor- 
tance of the distance function d(x,c). However, it will now be fruitful to think of /3 as the inverse 
of temperature. At the high temperature limit (low f3), the entropy H has the biggest role in min- 
imizing the free energy, so a system consisting of only one cluster centroid is preferred. At the low 
temperature limit (high /3) , the distortion dominates and the minimum-energy configuration is then 
the one where we have one centroid placed on top of every data point, with each data point belonging 
with probability one to the centroid it coincides with. Thus, the system has "cooled down" to the 
point where the freedom of objects to associate with distant centroids has disappeared. Between 
these two extremes, there must be critical values of (3 at which phase transitions occur; that is, when 
the natural solution involves including more centroids. 

We find these phase transitions by taking a cluster c and a twin c* of c such that the centroid 
P(-|c*) is a small random perturbation of P{-\c). Below the critical (3 at which c splits, the mem- 
bership and centroid iterative reestimation procedure will make P{-\c) and P{-\c*) converge, from 
which we infer that c and c* are really the same cluster. But if /3 is above the critical value for c, 
the two centroids will diverge, giving rise to two children of c. 



A sketch of our clustering procedure appears in figure 3.2. We start with very low (3 and a single 
cluster whose centroid is the average of all noun distributions (and so is guaranteed to be nonzero for 
all y). For any given /3, we have a current set of Zea/ clusters corresponding to the current free energy 
minimum. To refine such a solution, we search for the lowest /3 that causes some leaf cluster to split. 
Ideally, there is just one split at that critical value, but for practical performance and numerical 
accuracy reasons we may have several splits at the new critical point. The splitting procedure can 
then be repeated to achieve the desired number of clusters or model cross-entropy. 



create initial centroid 

REPEAT until (3 ~ (3max or enough clusters: 
For each centroid c, create twin c* 

REPEAT until twins (c, c*) split or too many iterations: 
Estimate membership pro bs by ( 3.9 ) 
Estimate centroids by ( 3 . 14 ) 

IF more than one centroid split 
THEN [raised f3 too quickly] 

lower f3 
ELSE IF no centroid split 

raise f3 
ELSE [one centroid split] 

raise f3 
delete extra twins c* 



Figure 3.2: Clustering algorithm 
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Figure 3.3: Direct object clusters for verb fire 



3.4 Clustering Examples 

The properties that the child can detect in the input - 
and adjacency and co-occurrence relations among words 
irrelevant. (Pinker, 1984, pg. 50) 



such as the serial positions 
- are in general linguistically 



In this section, we describe experiments with clustering words using the procedure described in 
the previous section. As explained there, our clustering procedure yields for each value of /? a set 
Cf3 of clusters minimizing the free energy F, with the model estimate for the conditional probability 
of a verb y given a noun x being 

P{y\x) = P{c\x)P{y\c), 

where P{c\x) depends on (3. Recall that the pair {x, y) means that x occurred as the head noun of 
the direct object of verb y; for example, the pair (thesis, write) might be extracted from the sentence 
"You should write your thesis" . 

In our first experiment, we wanted to choose a small set of nouns that we could be sure bore 
some relation to one another. Therefore, we chose the set X to consist of the 64 nouns appearing 
most frequently as heads of direct objects of the verb "fire" in the Associated Press newswire for 
1988. In this corpus, the chosen nouns appeared as direct object heads of a total of 2147 distinct 
verbs, so each noun was represented by a density over 2147 verbs. 

Figure 3.3 shows the five words most similar to the cluster ccntroid for the four clusters resulting 
from the first two cluster splits, along with the KL divergences from the centroids. It can be seen 
that the first split separates the objects corresponding to the weaponry sense of "fire" (cluster 1) 
from the ones corresponding to the personnel action (cluster 2). The second split then further refines 
the weaponry sense into a projectile sense (cluster 4) and a projector (of projectiles) sense (cluster 
3). That split is somewhat less sharp, perhaps because not enough distinguishing contexts occur 
in the corpus. Notice that "rocket" is close to both centroids 3 and 4 and therefore has a high 
probability of belonging to both classes: our "soft" clustering scheme allows this type of ambiguity. 
Note that the "senses" we refer to are our own designations for the clusters - the algorithm does 
not decide what the sense(s) of a cluster actually are. 
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Figure 3.4: Noun clusters for Grolier's encyclopedia 



30 



Our second experiment was performed on a bigger data set: we used object-verb pairs involving 
the 1000 most frequent nouns in the June 1991 electronic version of Grolier's Encyclopedia (10 
million words). Figure 3.4 shows the four closest nouns for each centroid in a set of hierarchical 
clusters derived from this corpus. Again, we notice that the clusters and cluster splits often seem 
to correspond to natural sense distinctions. We also observe that a general word like "number" is 
close to quite a few cluster centroids. 



3.5 Model Evaluation 

The preceding qualitative discussion provides some indication of what aspects of distributional re- 
lationships may be discovered by clustering. However, we also need to evaluate clustering more 
rigorously as a basis for models of distributional relationships. We now look at two kinds of mea- 
surements of model quality: (i) KL divergence between held-out data and the asymmetric model, 
and (ii) performance on the task of deciding which of two verbs is more likely to take a given noun 
as direct object when the data relating one of the verbs to the noun has been withheld from the 
training data. 

The evaluation described below was performed on a data set extracted from 44 million words of 
1988 Associated Press newswire by using the pattern-matching techniques mentioned earlier. This 
collection process yielded 1112041 verb-object pairs. We then selected the subset involving the 
1000 most frequent nouns in the corpus for clustering, and randomly divided it into a training set of 
756721 pairs and a test set of 81240 pairs. Figure 3.5 shows the closest nouns to the cluster centroids 



in an early stage of the hierarchical clustering of the training data. 
3.5.1 KL Divergence 



Figure 3.6 plots the aggregate KL divergence of several data sets to cluster models of different sizes; 
the higher the KL divergence, the worse the coding inefficiency of using the cluster model. The 
aggregate KL divergence is given by 

J2d{Pmle{-\x)\\P{-\x)). 

X 

For each critical value of we show the aggregate KL divergence with respect to the cluster model 
based on Cp for three sets: the training set (set train), a randomly selected held-out test set (set 
test), and a set of held-out data for a further 1000 nouns that were not clustered (set new). 

Not surprisingly, the training set aggregate divergence decreases monotonically. The test set 
aggregate divergence decreases to a minimum at 206 clusters and then starts increasing, which 
suggests that the larger models are overtrained. 

The new noun test set is intended to evaluate whether clusters based on the 1000 most frequent 
nouns are useful classifiers for the selectional properties of nouns in general. We characterize each 
new noun x by its maximum hkelihood distribution Pml£;(-| x) as estimated from the new sample (we 
can't use the training data since the new nouns by definition don't appear there). The corresponding 
cluster membership probabilities for a new noun then have the form 

P{c\x) = exp {-mPMLE{-\x)m-\c))) iZx 

and the model probability estimate is calculated as before. As the figure shows, the cluster model 
provides over one nat of information about the selectional properties of the new nouns, although the 
overtraining effect is even more pronounced than for the held-out data involving the 1000 clustered 
nouns. 
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Figure 3.5: Noun clusters for 1988 Associated Press newswire 
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Figure 3.6: Model evaluation, 1988 Associated Press object-verb pairs 
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Figure 3.7: Pairwise verb comparisons, 1988 Associated Press object-verb pairs 
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3.5.2 Decision Task 



We also evaluated our cluster models on a verb decision task related to applications in disambiguation 
in language analysis. The task consists of judging which of two verbs y and y' is more likely to take 
a given noun x as object when all occurrences of (x, y) in the training set were deliberately deleted. 
Thus, this test evaluates how well the models reconstruct missing data from the cluster centroids, 
since we are interested in cluster models that can help solve sparse data problems. 

The data for this test was built from the training data for the previous one in the following way, 
based on an experiment by Dagan, Marcus, and Markovitch ( 1995| ). We randomly picked 104 object- 
verb pairs {x^y) such that verb y appeared fairly frequently (between 500 and 5000 occurrences), 
and deleted all occurrences of such pairs from the training set. The resulting training set was used 
to build a sequence of cluster models as before. To create the test set, for each verb ?; in a deleted 
pair, a confusion set {y,y'} was created. Then, each model was presented with the triple {y,x,y'), 
and was asked to decide which of y and y' is more likely to appear with a noun x. 

Of course, we need some way of judging correctness without having access to the true pair 
probabilities, since the source distribution for natural language is presumably unknown. We fall 
back on the empirical frequencies to give us a rough estimate of the correct answer. Since these 
frequencies are known not to be entirely accurate (otherwise, we would have no need of cluster 
models!), we choose to create confusion sets for a noun x out of pairs of verbs y and y' such that 
one of the verbs occurred at least twice as often with x than the other in the original data set (prior 
to the pair deletion). Thus, we can be reasonably sure that whichever verb occurred with x more 
often in the training set truly has a higher probability of co-occurrence. 

In order to evaluate performance, we compare the sign of log (^P{y\x)/ P{y'\x)j with that of 

log{PMLE{y\x)/ PAiLE{y'\x)) On thc initial data set. The error rate for each model is simply the 
proportion of sign disagreements over the test corpus. Figure 3.7 shows the error rates for each model 
on all the selected {y,x,y') (all) and for just those exceptional triples in which the log frequency 
ratio of {x, y) and {x, y') differs from the log marginal frequency ratio of y and y' . 

The exceptional cases are especially interesting in that estimation methods (such as Katz's back- 
off method) based just on the marginal frequencies, which the initial one-cluster model represents, 
would be consistently wrong. We see that the cluster model tremendously outperforms classic 
estimation methods in the exceptional cases, and thus has the potential to provide a much better 
solution to the sparse data problem. Furthermore, while some overtraining effects can be observed 
for the largest models considered, these effects do not appear for the exceptional cases. 



3.6 Related Work 



It is beyond the scope of this thesis to provide a review of the entire body of clustering literature; 
data clustering has been discussed in fields ranging fr om statistics to biology. One list of journals tha t 
publish papers on the subject contains 987 entries ( Classification Society of North America, 1996 ); 
indeed, a summary of various clustering methods is a thesis in itself (Anderberg, 1973, "substantially 
this same text was submitted as a dissertation", pg. xiii). We therefore narrow our focus to tw o 
subjects: clustering methods appearing in the natural language-processing literature (section |3.6.1 ), 
and other probabilistic clustering methods (section [3.6.2 ). 



3.6.1 Clustering in Natural Language Processing 

Quite a few methods for distributional clustering have appeared in the literature of the natural 
language processing community, although to the best of our knowledge, our work is the first to use 
soft clustering in a language-processing context. The algorithms we will describe here algorithms 
fall into two categories: those that seek to find classes corresponding to human concepts, and those 
that create classes for the purpose of improving language modeling. 

As an aside, we note that these two categories correspond to two orthogonal trends in clustering 
work in general. The first trend, readily apparent in recent work on data mining and knowledge 
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discovery, is to find clusters that are somehow well-formed. Work in this vein uses optimization 
criteria concerning cluster structure; for instance, our distortion function T) measures the average 
distance between objects and centroids. The other trend is to find clusters that aid in the perfor- 
mance of some task; work in this area uses optimization criteria based on likelihood or some other 
performance measure. 



Clustering for Clusters' Sake 

Most of the methods whose end goal is the production of clusters (and therefore do not test whether 
the clusterings aid in the performance of some task) are geared towards finding either semantic or 



syntactic classes. The work of Hatzivassiloglou and McKeown (1993) (henceforth H&M) is notable 



because they provide a way to evaluate the goodness of semantic clusterings; many other papers 



(for example. Finch and Chater (1992) or Schiitze (1993)) merely present example clusters and state 



that the derived classes seem to correspond to intuition. 

H&M describe a hard clustering scheme for grouping semantically-related adjectives. They treat 
adjectives as distributions over the nouns they modify, and use Kendall's r coefficient (studied in 



section 2.3.4) to measure the distance between these distributions. Their optimization function is 
one of well-formedness: it rewards partitions that minimize the average distance between adjectives 
in the same cluster. They carefully delineate a rigorous evaluation method for comparing clusterings 
produced by their algorithm against clusterings produced by human judgesj^ computing precision, 
recall, fallout and F-measure results with respect to an average of the responses given by the judges, 
thereby taking into account the fact that humans do not always agree with each other. 

An interesting feature of their work is that they incorporate negative linguistic similarity infor- 
mation. By simply observing that adjectives in the same noun phrase should not, for a variety of 
linguistic reasons, be placed in the same class, they get dramatically better results (17-50% improve- 
ment across the various performance metrics). 

Some superficial similarities with our clustering work are readily apparent. The distributional 
similarity component of H&M's system treats adjectives as distributions over nouns, while we treat 
nouns as distributions over verbs. Also, we and H&M both used Associated Press newswire as 
training data, although H&M only used 8.2 million words, as opposed to our 44 million. However, 
our results are incomparable because our goals differ. H&M explicitly aim to create classes of 
semantically related words, and so must solicit human judgments. They were therefore constrained 
by human limitations to clustering only 21 adjectives. We, on the other hand, are more interested 
in clusterings that improve performance and so make use of a great deal more data. 

An independent body of work seeking to build classes corresponding to human intuitions is the 
field of language clustering. Many researchers in comparative lexicostatistics study the problem of 
how to create hierarchical clusterings that correspond to the evolution and splitting off of languages 



over time. Black and Kruskal (1997) give a short history and bibliography of the field. 



Clustering for Language Modeling 

A large number of papers have been written on using class-based models to improve language 
modeling (five such papers appear in the 1996 ICASSP proceedings alone ( pig, 1996| )). A common 
approach is to group words by their parts of speech. However, there is no reason to believe that 
classifications based on parts of speech are optimal with respect to language modeling performance, 
so we look at papers which present novel clustering techniques. Since the methods we discuss all 
attempt to create probabilistic models with strong predictive power, it is not surprising that they 
are all guided by the maximum likelihood principle. 

The most well-known class-based method is the work by [Brown et al. (19*9^ ) . In their setting, 
the set of objects and the set of contexts are the same {X = y — W); the pair {wi,W2) denotes 
the appearance of the two-word sequence wiW2 in the training sample. Brown et al. assume a 



^One minor criticism: the number of clusters to create was a parameter given to the system, whereas the humans 
were free to choose whatever number of clusters they wished. 
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Boolean clustering of the data, so that each word w belongs only to the class c{w), where c(-) is the 
membership function. Then, their class-based probability estimate takes the form 



P{W2\WI) ^ P{W2\C{W2))P{C{W2MWI)). (3.16) 

Given the membership function, the parameters P{w\c{w)) and P{c{w2)\c{wi)) are determined by 
sample frequencies, so only the function c(-) needs to be estimated. This is done by attempting to 
find class assignments that maximize the average mutual information (/) of the clusters, which in 
the limit is equivalent to maximizing the likelihood: if tit2 ■ ■ - tn is the training text, then 

Lc = ^-logP(t2...t„|ii) 

n — 1 

« -H{t) + {I), 

where H{t) is the entropy of the unigram (single word) distribution, which we can consider to be 
fixed. 

A serious problem Brown et al. face is that they do not have a way to calculate good estimates for 
c(-). Therefore, in each iteration step of their agglomerative clustering algorithm, they are forced to 
try many different merges of classes to find the one yielding the best improvement in (/) . After some 
amount of care, they are able to derive an algorithm that takes 0(|>Vp) time in each iteration step, 
whereas in the same setting our iteration steps would take 0(|C||Wp) time, which is a significant 
savings if the number of clusters is small relative to the number of words. Also, once the desired 
number of clusters has been achieved, Brown et al. shift words from cluster to cluster in order 
to compensate for premature groupings of words in the same class - this is the rigidity problem 
referred to in the quotation from Kaufman and Rousseeuw earlier in this chapter. Since we create 
a soft clustering, we never have to compensate for words being incorrectly classed together. At any 
rate, Brown et al.'s method potentially involves much wasted computation since both good and bad 
merges and shifts must be tried, whereas we are guaranteed that each step we take reduces the free 
energy. 

Brown et al. do present an alternative algorithm which spends 0(|Cp) time in each iteration. 
This algorithm sorts the words by frequency and puts the top k into their own classes. Each iteration 
step consists of adding the next most frequent word yet to be clustered as a new class and then 
finding the best merge among the new set of classes; when this merge is taken, the system once 
again has k clusters. On the other hand, it is possible that this heuristic narrows the search down so 
much that good classings are missed. This may well explain the small perplexity reduction achieved 
by Brown et al. 's method on the Brown corpus (from 244 to 236 using a model that interpolates the 
class-based model with word-based estimators). 



Another commonly-cited class-based language- modeling method, that of Kneser and Ney (1993) 



is presented by Ueberla (1994). In many respects, Kneser and Ney's work is quite similar to that 
of Brown et al. The same probability model ( [3.16| ) is used, and some of the heuristics employed to 
speed up calculations are the same as well. However, their optimization criterion differs, although 
it, too, is derived via the maximum likelihood principle. Instead of an agglomerative clustering 
algorithm, Kneser and Ney start with the desired number of clusters, so that the only operation 
undertaken to improve the clustering is to move words from one cluster to another, searching for 
the move that makes the biggest improvement. The running time of each such iteration step is 

o{\w\-{\w\ + m). 

Ueberla reports that Kneser and Ney's method achieves perplexity improvements of up to 25% 
on Wall Street Journal data with respect to Katz's back-off method. This is a rather stunning 
result. However, the class-based model uses a smoothing method known as absolute discounting 



(Ney and Essen, 1993). An interesting question is how much of the performance is due to the 
smoothing method and how much is due to the clustering (Brown et al. did not smooth the data); 
no comparison was done between the class-based method and the absolute discounting method. 
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3.6.2 Probabilistic Clustering Methods 

One of the first papers to discuss the notion ofprobabilistic clustering is that of Ruspini (1970|) , who 



was inspired by Zadeh's work on fuzzy sets ( ^adeh, 1965| ). His method attempts to find membership 
probabihties (which he caUs "degrees of belongingness" ) that optimize certain weU-formedness con- 
ditions similar to our distortion function P. However, he does not attempt to mathematically derive 
good estimates, so the search for good parameter settings consists of repeatedly altering one mem- 
bership probability while keeping all the others fixed. Furthermore, his method relies on distances 
between objects, rather than between objects and average distributions. This poses no problem in 
his case because he only considers artificial problems where the true distances are known. In prac- 
tice, however, estimates of inter-object distances can be quite sensitive to noise; centroid methods 
overcome this problem by averaging together many points. 



The fuzzy k-means method (Bezdek, 1981), a generalization of the /c-means approach, bears some 
resemblance to our procedure. It is a centroid method using the Euclidean distance {L2) as distance 
function. The centroid distributions depend on the squares of the membership probabilities: 



P{y\c) = 



iP{c\x)f PMLE{y\x) 



{P{c\x)) 



and the membership probabilities in turn depend on the positions of the centroids. The optimization 
function rewards clusterings that minimize the distance between objects and centroids: 



X 



This is a well-formedness condition rather than a maximum likelihood criterion, and in fact fuzzy 
fc- means is not meant to produce probability estimates. It is also not meant to produce a hierarchical 
clustering; the number of centroids is kept constant throughout the iterated estimation process. 

The clustering procedures most similar to our own are the deterministic annealing approaches; 
these include the work of Rose, Gurcwitz, and Fox (199Cl| ) (which influenced our approach) and 



Hofmann and Buhmann (1997). These both find clusters that minimize the free energy (3.5). An 
important difference is that they use the squared Euclidean distance (^2), whereas we use the KL 
divergence as distance function. In the distributional setting we have been considering, using the KL 
divergence is well-motivated, whereas it is not entirely clear why the L2 norm would be meaningful. 
Bayesian methods (Wallace and Dowe, 1994; pheeseman and Stutz, 1996| ) combine well- 



formedness constraints and performance criteria. They seek to find the model with the maximum 
posterior probability given the data, where the posterior probability is based on the product of the 
model prior and the likelihood the model assigns to the data. The prior is based on the structure of 
the cluster system, and in general encodes a bias for fewer clusters; such a prior serves to balance 
out the tendency of maximum likelihood criteria to reward systems that have a large number of 
clusters. This is analogous to our inclusion of a maximum entropy condition in the derivation of our 
method, since the maximum entropy criterion also tends to favor having fewer clusters. 



The methods of Wallace and Dowe (1994 ) and Cheeseman and Stutz (1996 ) do not yield cluster 



hierarchies because the number of clusters is allowed to fiuctuatc from iteration step to iteration 



step. The "class hierarchy" described by Hanson, Stutz, and Cheeseman (1991) does not consist of 



classes but rather of attributes: each node in the dendrogram represents a collection of parameter 
settings inherited by all the descendents of that node. 

3.7 Conclusions 

We have described a novel clustering procedure for probability distributions that can be used to 
group words according to their participation in particular grammatical relations with other words. 
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Our method builds a hierarchy of probabihstic classes using an iterative algorithm reminiscent of 
EM. The resulting clusters are intuitively informative, and can be used to construct class-based word 
coocurrence models with substantial predictive power. 

While the clusters derived by the proposed method seem in many cases semantically signifi- 
cant, this intuition needs to be grounded in a more rigorous assessment. In addition to evalutions 
of predictive power of the kind we have already carried out, it might be worthwhile to compare 
automatically-derived clusters with human judgements in a suitable experimental setting, perhaps 
the one suggested by Hatzivassiloglou and McKeown (1993). In general, however, the development of 
methods for directly measuring cluster quality is an open research area; the problem is compounded 
when one takes hierarchical clusterings into account. 

Another possible direction to take would be to move to other domains. For instance, document 
clustering has been studied by many researchers in the field of information retrieval. Recently, there 
has been renewed interes t in using document clustering as a browsing aid rather than a search tool 
( see Cutting et al. (1992 ) for a short discussion), and also as a way to organize documents (Yahoo! 
( 1997 ) provides a hierarchical clustering in which documents can appear in more than one class). 
In situations where \X\ and \y\ are very large, our clustering algorithm may be somewhat slow; 
however, the extra descriptive power provided by our probabilistic clustering may well be worth the 
extra computational effort. 
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Chapter 4 

Similarity-Based Estimation 



In the previous chapter we looked to cluster centroids as a source of information when data on 
a specific event £ was lacking. This chapter introduces an alternative model, where we instead 
look to the events most similar to £. For convenience, we will refer to the new type of model as 
similarity-based^ although our clustering method of the preceding chapter also made use of the notion 
of similarity. 



4.1 Introduction 

In the previous chapter, we described a method for automatically clustering distributional data, 
and showed that we can use the clusters so derived to construct effective models for predicting 
probabilities in situations where data is lacking. The clustering method was divisive: the system 
started with just one cluster centroid, and as the temperature was slowly lowered, phase transitions 
caused cluster centroids to split. This splitting of centroids meant that the number of clusters k did 
not have to be determined beforehand; rather, all possible numbers of clusters could be considered 
in a fairly efficient fashion, and the best configuration could be chosen via cross-validation. For 
clustering algorithms that keep the number of clusters constant throughout the estimation process, 
the only way to try out many different numbers of clusters is to re-run the algorithm with a different 
value of k each time. But it is generally the case for these algorithms that the results of the 
computation for one k cannot be used to aid the computation for a different k, so the search for the 
right k is not very efficient. 

An interesting alternative is a nearest-neighbor approach, where given an event £ whose prob- 
ability we need to estimate, we consult only those events that are most similar io £. In a sense, 
we allow each event to form the centroid of its own class, and thus avoid having to find the right 
number of clusters. While this approach does not reduce the size of the model parameter space as 
class-based approaches do, it avoids the over-generalization that class-based models can fall prey to. 
As Pagan, Marcus, and Markovitch (1995 ) argue, using class information to model specific events 
may lead to too much loss of information. Probabilistic clusterings ameliorate this problem some- 
what by combining estimates from different classes, using membership probabilities to weight the 
class estimates appropriately, but the concern about over-generalization is still valid. 



4.1 



shows 



We present a small example to make the over-generalization problem clearer. Figure 
a situation in which there are four objects (the empty circles) and only one centroid (the grey circle 
in the middle). Let us assume we are trying to model the behavior oi X. In a cluster-based centroid 
model, the estimate for the behavior of X depends on the behavior of the centroid; this dependence 
is indicated by the arrow from the centroid to X. However, the behavior of the centroid is an average 
of the behaviors of all the other points, including A, B, and C, as indicated by the arrows pointing 
to the centroid. Therefore, an estimate for X depends not only on A and C, which are relatively 
close to it, but also on B, which is much farther way. It might make more sense to use only points 
A and C in trying to estimate X. 
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Figure 4.1: Centroid overgeneralization 



We therefore turn our attention in this and the next chapter to similarity-based language mod- 
eling techniques that do not require building general classes. While our cluster model, described 
in the previous chapter, estimated the conditional probability P{y\x) of an object-context pair by 
averaging together class estimates P{y\c), weighting the evidence of each class by the degree of 
association P{c\x) between x and c: 



P{y\x)=J2Hc\x)P{y\c), 



cec 



our new object-centered model replaces the centroids by other objects: 



(4.1) 



where f{x,x') depends on the similarity between x and x' . 

We are not the originators of equation (4.1). Simila rity-based estimation was first used for 
language modeling in the cooccurrence smoothing method of Essen and Steinbiss (1992 ) , derived from 
work on acoustic model smoothing by Sugawara et al. (1985 ). Karov and Edelman (199^ ) develop 
a similarity-based disambiguation method that also can be fit into the framework of equation (4.1); 
however, since their method does not estimate probabilities and relies on a similarity function that 
is calculated via an iterative process, we will not give further consideration to their work here. 
In this chapter we establish proof of concept: we discuss and compare ways to instantiate equation 



(4.1), using a simple decision task for evaluation purposes. The KL divergence will once again prove 
to be an effective measure of dissimilarity. In the next chapter we evaluate a similarity-based model 
on more true-to-life tasks that test the utility of our method for speech recognition; we use a more 
complicated version of the model presented here, incorporating several heuristics in order to speed 
up the computation. 



4.2 Chapter Overview 



As in the previous chapter, our goal is to estimate the (conditional) probability of object-context 
pairs (x, y) € X x y. Our first concern in this chapter is to describe similarity-based estimation 
methods in general. In section 4.3 we develop a common framework for these methods, so that the 
only parameter that varies from method to method is the similarity function used. In the following 



section (4.4) we describe various similarity functions. The majority are based on distance functions 
studied in chapter p|, but we also discuss the confusion probability, which appears in the work of 



Essen and Steinbiss (19921 ) 



The second part of this chapter describes our evaluation of similarity-based methods. In section 
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4.5.1, we introduce the problem of pseudo-word disambiguation, a task which is related to the usual 
word sense disambiguation problem, but presents many advantages in terms of ease of experimen- 
tation. After a discussion of the data used to construct basic language models and a comparison of 
these basic models (4.5.2), we look at a few examples to get a qualitative sense for how the different 
similarity functions perform ( 1.5.3| ). Finally, section 4.5.4 presents five-fold cross-validation results 
for the similarity-based methods and for several baseline models. Our tests show that indeed, sim- 
ilarity information can be quite useful in sparse data situations. In particular, we found that all 
the similarity-based methods performed almost 40% better than back-off if unigram frequency was 
eliminated from being a factor in the decision. 

An interesting phenomenon we observe is that the effect of removing extremely rare events 
from the training set is quite dramatic when similarity-based methods are used. We found that, 
contrary to a claim made by K atz that such events can be discarded without hurting language 
model performance ( Katz, 1987 ), similarity-based smoothing methods suffer noticeable performance 
degradation when singletons (events that occur exactly once) are omitted. 

Throughout this chapter, the base of the logarithm function is 10. 



4.3 Distributional Similarity Models 

A similarity-based language model consists of three parts: a scheme for deciding when to use 
similarity-based information to determine the probability of a word pair, a method for combin- 
ing information from similar words, and, of course, a function measuring the similarity between 
words. We give the details of each of these three parts in the following three sections. 



4.3.1 Discounting and Redistribution 

We hold that it is best to always use the most specific information available. While the maximum 
likehhood estimate (MLE) 

o / I N C{x,y) 

(equation (2.1) from chapter ^, where C(z) is the number of times event z occurred in the training 



data) yields a terrible estimate in the case of an unseen word pair, it is pretty good when sufficient 



data exists. Therefore, Katz's (1987) implementation of the Good- Turing discounting method, 
described in chapter ^, provides an attractive framework for similarity-based methods; it uses the 
(discounted) MLE when the pair (x, y) occurs in the data, and a different estimate if the pair does 
not occur: 

p(y\x) = /^'^(yl-^) C{x,v) > 

la(x)Fr(2/|a;) otherwise ((x, y) is unseen) ' 



Recall that equation ( |2.3| ) actually represents a modification of Katz's formulation: we have written 
Pr{y\x) where Katz has P{y)- This allows us to use similarity-based estimates for unseen word pairs, 
rather than simply backing off to the probability of the context y. Observe that this formulation 
means that we will use the similarity estimate for unseen word pairs only, as desired. 

We next investigate estimates for Pr{y\x) derived by averaging information from objects that are 
distributionally similar to x. 



4.3.2 Combining Evidence 

The basic assumption of a similarity-based model is that if object x' is "similar" to object x, then 
the behavior of x' can yield information about the behavior of x. When data on x is lacking, then, 
we average together the distributions of similar objects, weighting the information furnished by a 
particular x' by the similarity between x' and x. 
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More precisely, let W{x,x') denote an increasing function of the similarity between x and x'; 
that is, the more similar x and x' are, the larger W{x,x') is. Let S{x) denote some set of objects 
that are most similar to x (we discuss the exact form of S{x) in the next paragraph). Then, the 
general form of similarity model we consider is a M^-weighted linear combination of predictions of 
similar objects: 

x'65(x) l^-'&Six)W{x,X ) 

Observe that according to this formula, we predict that y is likely to occur with x if it tends to occur 
with objects that are very similar to x. 



Considerable latitude is allowed in defining the set S{x). Essen and Steinbiss (1992) and Karov 



and Edelman (1996) (implicitly) set S{x) — X. However, if X is very large, it is desirable to 
restrict S{x) in some fashion, so that summing over all x' ^ X Ss, not too time-consuming. In the 
next chapter, we will consider various heuristics for choosing a small set of similar words. These 
heuristics include setting a limit on the maximum size of S(x\ and only allowing an object x' to 
belong to S{x) if the dissimilarity between x and x' is less than some threshold value. We will show 
some evidence at the end of this chapter that limiting the size of the set of closest objects does not 
greatly degrade performance, at least for the best similarity-based models. 

The approach taken in this chapter is to use Psim as the probability redistribution model in 
equation ( ^.3| ), i.e., Pr{y\x) = PsiM{y\x)- In the next chapter we discuss a variation in which is 
a linear combination of Psim and another estimator. 



4.4 Similarity Functions 



The final step in defining a similarity-based model is to choose what similarity function to use. 
We first look in section 4.4.1 at three functions from chapter || that measure the distance between 
distributions. For each of these functions, it is necessary to define a weight function W{x,x') that 
"reverses the direction" of the distance function, since we need weights that have larger values when 
the distributions are less distant. Section 4.4.2 describes some of the properties of the confusion 



probability, which was used to achieve good performance results by Essen and Steinbiss (1992). 
Section 4.4.3 discusses the base language models from which object distributions are computed, and 



also summarizes some properties of the four similarity functions we will compare. 

Regardless of which similarity function is chosen, in order to make the computation of equation 
(4.2) efficient it is useful to compute the \X\ x \X\ matrix of similarities W{xi,Xj) or distances 
d{xi,Xj) (for arbitrary distance functions d) beforehand. 



4.4.1 Distance Functions 

In chapter ^ we studied several functions measuring the distance between probability distributions. 
These included the KL divergence (section 2.3.1) 



i^(x||.T') = E^(y|^)iog 



P{y\x')' 



the total divergence to the mean (section ^.3.2 ) 



Aix,x')^D{x\\^) + Dix'\\^) 

{{x + x')/2 denotes the probability mass function {P{-\x) + P{-\x'))/2), and the Li norm (section 
2.3. 3D 

L,{x,x') = Y,\P{y\x)-P{y\x')\. 
y 
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Since these functions are all distance functions, they decrease when the similarity between x and x' 
increase. However, we desire weight functions W{x, x') that are increasing in the similarity between 
X and x' . 

In the case of the KL divergence D, we set W{x,x') to be 

Wd{x,x') = lQ-0D{x\\x')^ 

(3 is an experimentally-tuned parameter controlling the relative influence of the objects closest to x: 
if /3 is high, then W{x,x') is non-negligible only for those x' that are extremely close to x, whereas 
if f] is low, objects that are somewhat distant from x also contribute to the estimate. The choice of 
a negative exponential form is motivated by the fact that the probability of drawing an i.i.d. sample 
of size n with empirical distribution P from a multinomial Q is lO^"'^'-'^!!'^) f^j-gf^ order in the 



exponent - this is theorem 2.2 from section 2.3.1 



When the distance function is the total divergence to the mean, we also use a negative exponen- 
tial: 

WAix,x') = iQ-PM^-^^'), 

Again, /3 controls the relative importance of the most similar objects and is determined experimen- 
tally 

Finally, we define the weight function for the Li norm to be 

WlAx,x') = {2-Li{x,x')f, 

with (3 playing the same role as in Wd and Wa above (we tried using the exponential form 
lQ-0Li(x,x )^ ^2 — Li{x,x'))^ yielded better performance results). 

We have made no attempt to normalize these various weight functions, so they take on different 
sets of values; for example, Wd{x,x) = Wa{x,x) = 1, but Wl{x,x) = 2^. Normalization is not 
necessary because our evaluation task ignores scale factors. 

4.4.2 Confusion probability 

Essen and Steinbiss ( |l992| ) introduced confusion probability in the cont ext of cooccurrence smoothing 



for language modeling. Cooccurrence smoothing was also applied by Grishman and Sterling (1993 ) 
to the problem of estimating the likelihood of selectional patterns. 

Of the four similarity-based models Essen and Steinbiss consider, we choose to describe and 
implement model 2-B (equivalent to model 1-A) because it was found to be the best performer of 
the four. Indeed, Essen and Steinbiss report test-set perplexity reductions of up to 14% on small 
corpora. Although they used an interpolation framework, where the similarity-based estimate was 
linearly interpolated with other estimators for seen as well as unseen events, we will for the sake of 



uniformity incorporate the confusion probability into the back-off-like framework of equation (2.3). 

The confusion probability represents the likelihood that object x' can be substituted for object 
x; it is based on the probability that x and x' are found in the same contexts; 

p.n^ Pix\y)Pix'\y)Piy) 

Pc{x \x) - ^ —- (4.3) 

y 

(the term P{x) is required to ensure that Pc{x'\x) = 1). Since this expression incorporates both 
conditional probabilities and marginal probabilities, it is n ot a measure of the distance between two 
distributions as are the functions described in section |2.3| . 

The confusion probability is symmetric in the sense that Pc{x'\x) and Pc{x\x') are identical up 
to frequency normalization: p^|^|j^^j = p^^)) ■ Unlike the measures described above, x may not be 
the "closest" object to itself, that is, there may exist an object x' such that Pc{x'\x) > Pc{x\x), as 



we shall see in section 4.5.3 
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Further insight into the behavior of Pc is gained by using Bayes' rule to rewrite expression (p~^): 

This form reveals another important difference between the confusion probability and the functions 
D, A, and Li described above. The latter three functions rate x' as similar to x if, roughly, P{y\x') 
is high when P{y\x) is. Pc{x'\x)^ however, is greater for those x' for which P{x',y) is large when 
P{y\x) I P{y) is. Notice that the case when the ratio P{y\x) / P{y) is large contradicts the back-off 
assumption that P{y) is a good estimate of P{y\x) when the pair (x, y) is unseen. 

While the fact that Pc is called a probability implies that it ranges between and 1, some 
elementary calculations show that in fact its maximum value is ^ maxj^ P{y)- Following Essen and 
Steinbiss, we choose the weight function W{XtX') to be the confusion probability itself without 
including the scale parameter (3. 



4.4.3 Base Language Models 

Throughout the above discussion, we have blithely referred to the quantities P{y\x), P{x), and P{y) 
without explaining where these quantities actually come from. These must be provided by some 
base language model P, but it turns out that there is some subtlety as to the form the base language 
model may take. 



As discussed in section 2.3.1 , the KL divergence Z?(x||x') is undefined if there exists a context y 
such that P{y\x) is greater than zero but P{y\x') is zero. This argues for a language model that is 
smoothed so that P{y\x') cannot be zero. A natural choice is to use the back-off estimate, so that 



P{y\x) = Psoinlx), where Pbo is given by equation (2.4). 

However, the normalization of the confusion probability (4.3) requires that the base language 
model be consistent with respect to joint and marginal probabilities, that is, that 

P(a;)=^P(2/la;)P(a;). 

V 

Unfortunately, the back-off estimate does not have this property, since it discounts conditional 
probabilities without altering the marginals. Therefore, we use the maximum likelihood estimate as 
the base language model for Pc'- P{y\x) = PMLE{y\x) 

Thus, we cannot directly compare the performances of all four of the similarity-based models 
defined above because they require different base language models. In the experimental results 
section of this chapter, then, we will evaluate the total divergence to the mean, the Li norm, and 
the confusion probability, using Pmle as the base language model. Chapter ^ describes experiments 
where the KL divergence is used as the distance function and the back-off estimate is used as the 
base language model. 



Several features of the measures of similarity listed above are summarized in table 4.1 . "Base LM 
constraints" are conditions that must be satisfied by the probability estimates of the base language 
model. The last column indicates whether the weight W{x,x') associated with each similarity 
function depends on a parameter that needs to be tuned experimentally. 



4.5 Experimental Results 

We evaluated three of the similarity measures described above on a word sense disambiguation task. 
Each method is presented with a noun and two verbs, and must decide which verb is more likely to 
have the noun as a direct object. Thus, we do not measure the absolute quality of the assignment 
of probabilities, as would be the case in a standard language model evaluation such as perplexity 
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distance 


range 


base LM constraints 


tune? 


D 
A 
Li 
Pc 


[0,oo] 
[0,2 log 2] 
[0,2] 

[0, i maxj^ P{y)] 


P{y\x')^QiiP{y\x)^Q 
none 
none 
Bayes consistency 


yes 
yes 
yes 
no 



Table 4.1: Summary of similarity function properties 



reduction (defined in the next chapter) but merely ask that a method be able to distinguish between 
two alternatives. We are therefore able to ignore constant factors, and so need neither normalize 
the similarity measures to lie between and 1 nor calculate the denominator in equation (4.2). 



4.5.1 Pseudo-word Sense Disambiguation 

In the usual word sense disambiguation task, the method to be tested is presented with an ambiguous 
word in some context, and is asked to use the context to identify the correct sense of the word. For 
example, a test instance might be the sentence fragment "robbed the bank" ; the disambiguation 
method must decide whether "bank" refers to a river bank, a savings bank, or perhaps some other 
alternative. 

While sense disambiguation is clearly an important task, it presents numerous experimental 
difficulties. First of all, the very notion of "sense" is not clearly defined; for instance, dictionaries 
may provided sense distinctions that arc too fine or too coarse for the data at hand. Also, one needs 
to have training data for which the correct senses have been assigned, which can require considerable 
human effort. 

To circumvent these and other difficulties, we set up a pseudo-word disambiguation experiment 



(Schiitze, 1992; Gale, Church, and Yarowsky, 1992), the general format of which is as follows. We 



first construct a list of pseudo-words, each of which is the combination of two different words in y. 
Each word in y contributes to exactly one pseudo-word. Then, we replace each y in the test set 
with its corresponding pseudo-word. For example, if we choose to create a pseudo-word out of the 
words "make" and "take" , we would change the test data like this: 

make plans ^ {make, take} plans 
take action=> {make, take} action 



The method being tested must choose between the two words that make up the pseudo-word. 

The pseudo-word set-up has two attractive features. First, the alternative "senses" are under 
the control of the experimenter. Each test instance presents exactly two alternatives to the disam- 
biguation method, and the alternatives can be chosen to be of the same frequency, the same part 
of speech, and so on. Secondly, the pre-transformation data yields the correct answer, so that no 
hand-tagging of the word senses is necessary. These advantages make pseudo-word experiments an 
elegant and simple means to test the efficacy of different language models. 



4.5.2 Data 



We ran our evaluation on the same Associated Press newswire data that we used for the clustering 
evaluation described in the previous chapter. To review, we set X to be the 1000 most frequent 
nouns in the data; y was the set of transitive verbs y that were observed to take a noun in X as 
direct object. The extraction of object-verb pairs was performed via regular pattern matching and 
concordancing tools (Yarowsky, 1992a) from 44 million words of 1988 Associated Press newswire, 
which had been automatically tagged with parts of speech (Church, 1988). Admittedly, regular 
expressions are inadequate for this task; although we filtered the results somewhat, some bad pairs 
doubtless remained. 
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Training data for base language models 
with singletons: 587833 
without singletons: 505426 



Parameter tuning and test data 



Ti: 


3434 


(tuning: 


13718) 


T2: 


3434 


(tuning: 


13718) 


T3: 


3434 


(tuning: 


13718) 


Ti-. 


3434 


(tuning: 


13718) 


n-- 


3416 


(tuning: 


13736) 



Table 4.2: Number of bigrams in the training, parameter tuning, and test sets. 



We used 80%, or 587833, of the pairs so derived, for building base bigram language models, 
reserving 20% for testing purposes. As some of the similarity measures to be compared require 
smoothed language models, while others do not, we calculated both a Katz back-off language model 
(P = Pbo) and a maximum likelihood model (P — Pmle)- Furthermore, we wished to investigate 
Katz's claim that one can delete singl etons^ word pairs that occur only once, from the training set 
without affecting model performance ( Katz, 1987 ); our training set contained 82407 singletons. We 
therefore built four base language models, summarized in table |4.3|. 



with singletons (587833 pairs) omit singletons (505426 pairs) 



MLE 
back-off 



MLE-1 
BO-1 



MLE-ol 
BO-ol 



Table 4.3: Base language models 

Since we wished to test the effectiveness of using similarity information for unseen word cooc- 
currences, we removed from the test set any object-verb pairs that occurred in the training set; this 
resulted in 17152 unseen pairs (some occurred multiple times). The unseen pairs were further di- 
vided into five equal-sized parts, Ti through Ts, which formed the basis for five- fold cross-validation: 
in each of the five runs, one of the Ti was used as a performance test set, with the other 4 sets 
combined into one set used for tuning parameters (if necessary) via a simple grid search. Finally, 
test pseudo- words were created from pairs of verbs with similar frequencies, so as to control for word 
frequency in the decision task. Our measure of performance was the error rate, defined as 

— (number of incorrect choices + (number of ties)/2) 
n 

where n was the size of the test corpus. A tie occurs when the two words making up a pseudo-word 
are deemed equally likely. 

We first look at the performance of the base language models themselves. Their error rates are 



summarized in table 4.4. MLE-1 and MLE-ol both have error rates of exactly .5 because the test 
sets consist of unseen bigrams, which are assigned a probability of by the maximum likelihood 
estimate. Since we chose to form pseudo-words out of verbs of similar frequencies, the back-off 
models BO-1 and BO-ol also perform poorly. 

Since the back-off models consistently performed worse than the MLE models, we chose to use 
only the MLE models in our subsequent experiments. Therefore, we only ran comparisons between 
the measures that could utilize unsmoothed data, namely, the Li norm, the total divergence to 
the mean, and the confusion probability. It should be noted, however, that on BO-1 data, the 
KL divergence performed slightly better than the Li norm; in the next chapter, we will study the 
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Ti 


T2 








MLE-1 


.5 


.5 


.5 


.5 


.5 


MLE-ol 












BO-1 


0.517 


0.520 


0.512 


0.513 


0.516 


BO-ol 


0.517 


0.520 


0.512 


0.513 


0.516 



Table 4.4: Base language model error rates 





L 




A 




Pc 


GUY 


0.000000 


GUY 


0.000000 


role 


0.032925 


kid 


1.229067 


kid 


0.304297 


people 


0.024149 


lot 


1.354890 


thing 


0.329062 


fire 


0.013092 


thing 


1.394644 


lot 


0.330871 


GUY 


0.012744 


man 


1.459825 


man 


0.350695 


man 


0.011985 


docto 


r 1.460766 


mother 


0.368966 


year 


0.009801 


girl 


1.479976 


doctor 


0.369644 


lot 


0.009477 


rest 


1.485358 


friend 


0.372563 


today 


0.009095 


son 


1.497497 


boy 


0.373881 


way 


0.008778 


bit 


1.497502 


son 


0.375474 


part 


0.008772 


(role: 


rank 173) 


(role: rank 43) 


(kid: rank 80) 



Table 4.5: 10 closest words to the word "guy" for A, L, and Pc, using MLE-1 as the base language 
model. The rank of the words "role" and "kid" are also shown if they are not among the top ten. 



performance of the KL divergence more carefully. 
4.5.3 Sample Closest Words 

In this section, we examine the closest words to a randomly selected noun, "guy" , according to the 
three measures Li, A, and Pc- 



Table 4.5 shows the ten closest words, in order, when the base language model is MLE-1. There 
is some overlap between the closest words for Li and the closest words for A, but very little overlap 
between the closest words for these measures and the closest words with respect to Pc- only the 
words "man" and "lot" are common to all three. Also observe that the word "guy" itself is only 
fourth on the list of words with the highest confusion probability with respect to "guy" . 

Let us examine the case of the nouns "kid" and "role" more closely. According to the similarity 
functions Li and A, "kid" is the second closest word to "guy", and "role" is considered relatively 
distant. In the Pc case, however, "role" has the highest confusion probability with respect to "guy," 
whereas "kid" has only the 80th highest confusion probability. What accounts for the difference 
between A and Li on the one hand and Pc on the other? 



Table 4.6, which gives the ten verbs most likely to occur with "guy", "kid", and "role", indicates 
that both Li and A rate words as similar if they tend to cooccur with the same words in y. Observe 
that four of the ten most likely verbs to occur with "kid" are also very likely to occur with "guy" , 
whereas only the verb "play" commonly occurs with both "role" and "guy" . 



If we sort the verbs by decreasing P(?; I "guy" )/P(?/), a different order emerges (table 4.7): "play", 
the most likely verb to cooccur with "role" , is ranked higher than "get" , the most likely verb to 
cooccur with "kid" , thus indicating why "role" has a higher confusion probability with respect to 
"guy" than "kid" does. 



Finally, we examine the effect of deleting singletons from the base language model. Table 4.8 



shows the ten closest words, in order, when the base language model is MLE-ol. The relative order of 
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Object 


Most Likely Verbs 


guy 


see get play let give catch tell do pick need 


kid 


get see take help want tell teach send give love 


role 


play take lead support assume star expand accept sing limit 



Table 4.6: For each object x, the ten verbs y with highest P{y\x). Boldface verbs occur with both 
the given noun and with "guy." The base language model is MLE-1. 



(1) electrocute (2) shortchange (3) bedevil (4) admire (5) bore (6) fool 
(7) bless • ■ • (26) play ■ • ■ (49) get ■ • ■ 



Table 4.7: Verbs with highest P(?/| "guy")/P(j/) ratios. The numbers in parentheses are ranks. 



the four closest words remains the same; however, the next six words are quite different from those for 
MLE-1. This data suggests that the effect of singletons on calculatio ns of similarity is quite strong, 



as is borne out by the experimental evaluations described in section tf.5.4| . We conjecture that this 
effect is due to the fact that there are many very low frequency verbs y in the data. Omitting 
singletons involving such words could then drastically alter the number of ?/'s that cooccur with 
both X and x' . Since our similarity functions depend on such words, it is perhaps not so surprising 
that the effect on similarity values of deleting singletons is rather dramatic. In contrast, a back-off 
language model is not as sensitive to missing singletons because of the Good- Turing discounting of 
small counts and inflation of zero counts. 

4.5.4 Performance of Similarity-Based Methods 



Figure 4.2 shows the error rate results on the five test sets, using MLE-1 as the base language model. 
The parameter (3 was always set to the optimal value for the corresponding parameter training set. 
RAND, which is shown for comparison purposes, simply chooses the weights W{x, x') randomly. 
S{x) was set equal to X in all cases. 

The similarity-based methods consistently outperform the MLE method (which, recall, always 
had an error rate of .5) and Katz's back-off method (which always had an error rate of about .51) 
by a huge margin; therefore, we conclude that similarity information is very useful for unseen word 
pairs where unigram frequency is not informative. The similarity-based methods also do much better 





L 




A 




Pc 


GUY 


0.000000 


GUY 


0.000000 


role 


0.050326 


kid 


1.174243 


kid 


0.300681 


people 


0.024545 


lot 


1.395178 


thing 


0.321719 


fire 


0.021434 


thing 


1.407363 


lot 


0.346137 


GUY 


0.017669 


reason 


1.416542 


mother 


0.364610 


work 


0.015519 


break 


1.424242 


answer 


0.366333 


man 


0.012445 


ball 


1.438618 


reason 


0.367112 


lot 


0.011255 


answer 


1.440296 


doctor 


0.373428 


job 


0.010992 


tape 


1.448657 


boost 


0.377174 


thing 


0.010919 


rest 


1.452688 


ball 


0.381274 


reporter 


0.010551 



Table 4.8: 10 closest words to the word "guy" for A, L, and Pc, using MLE-ol as the base language 
model. 
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than RAND, which indicates that it is not enough to simply combine information from other words 
arbitrarily: it is quite important to take word similarity into account. In all cases, A edged out 
the other methods. The average improvement in using A instead of Pc is .0082; this difference is 
significant to the .1 level {p < .085) according to the paired t-test. 



Error Rates on Test Sets, Base Language Model MLE-1 



■RANDMLE1- 

■ "CONFfyLEI" - 
"LMLEI"^-^-^ 
I — I "AMLEI- 

MLE{x) - 
"bol" -o- ■ 



Figure 4.2: Error rates for each test set, where the base language model was MLE-1. The methods, 
going from left to right, are RAND , Pc, Li, and A, and the performances shown are for settings 
of P that were optimal for the corresponding training set. (3 values for Li ranged from 4.0 to 4.5. /3 
values for A ranged from 10 to 13. 



The results for the MLE-ol case are depicted in figure 4.3. Again, we see the similarity-based 
methods achieving far lower error rates than the MLE, back-off, and RAND methods, and again, 
A always performed the best. However, omitting singletons amplified the disparity between A and 
Pc- the average difference in their error rates increases to .024, which is significant to the .01 level 
(paired t-test). 

An important observation is that all methods, including RAND, were much more effective if 
singletons were included in the base language model; thus, in the case of unseen word pairs, it is 
clear that singletons should not be ignored by similarity-based models. 

Recall that in these experiments we set S{x) = X . From the point of view of computational 
efficiency, it may not be desirable to sum over all the words in X. We experimented with using only 
the k closest words to x, where k varied from 100 to 1000 (= l-^D- 
stopping at k 



We see from figure 4.4 that 
600 is sufficient to capture most of the performance improvement. It also appears 



that L\ and A use the closest words more efficiently, as we could sum over 10 times fewer words 
(k — 100) at a performance penalty of less than 1%; stopping at fc = 100 for Pc would result in 
increasing the error rate by 4%. 

4.6 Conclusion 



Automatically-derived similarity-based language models provide an appealing approach for dealing 
with data sparseness. We suggest a framework which relies on maximum likelihood estimates when 
reliable statistics are available, and uses similarity-based estimates only in situations where data is 
lacking. 

We have described and compared the performance of four such models against two standard 
estimation methods, the MLE method and Katz's back-off scheme, on a pseudo-word disambigua- 
tion task. We observed that the similarity-based methods perform much better then the standard 
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Error Rates on Test Sets, Base Language Model MLE-ol 



;mNDMLEo1" 
•CCiNFWiLEol" 

■AMLE0I" 

MLE(x) 
"boo1" 



Figure 4.3: Error rates for each test set, where the base language model was MLE-ol. The methods, 
going from left to right, are RAND , Pc, Li, and A, and the performances shown are for settings 
of (3 that were optimal for the corresponding training set. /3 values for Li ranged from 6 to 11. /? 
values for A ranged from 21 to 22. 



methods on unseen word pairs, with the method based on the KL divergence to the mean being the 
best overall. 

We also investigated Katz's claim that one can build more compact language models without 
suffering significant performance degradation by discarding singletons in the training data. Our 
results indicate that for similarity-based language modeling, singletons are quite important; their 
omission leads to noticeably higher error rates. 
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Chapter 5 



Similarity-Based Estimation for 
Speech Recognition 

The previous chapter looked at the performance of three similarity-based methods on a simple 
disambiguation task. This chapter tackles the more realistic problems of perplexity reduction and 
speech-recognition error reduction. The distance function used here is the KL divergence, and the 
base language model is the back-off estimate. The similarity-based model considered in this chapter 
is based on the model developed in chapter ^ but has several added features meant to improve both 
performance and efficiency. 



5.1 Introduction 

Chapter ^introduced similarity-based methods, developed a general framework for them, and com- 
pared several such methods on a pseudo-word disambiguation task. The pseudo-word task was 
very convenient from an experimental point of view; it allowed us to limit the number of senses a 
(pseudo-) word could have, as well as control the probabilities of the different senses (recall that we 
chose to create pseudo- words out of verbs with similar frequencies). Thus, we were able to perform 
a very clean experiment to demonstrate that indeed, similarity-based methods do have the potential 
to outperform standard approaches to sparse data problems. 

However, it must be admitted that pseudo-word disambiguation seems a bit distant from prob- 
lems encountered in real-word applications. Therefore, in this chapter we evaluate a similarity-based 
method on two tasks: perplexity reduction and speech-recognition error rate. 

Perplexity is often used as a performance metric for language modeling systems; it is g enerally 
assumed that lowering th e perplexity is correlated with better performance in practice ( [Jelinek " 



Mercer, and Roukos, 1992 ). Let Plm be a probability model and S someQ sample of text. Then the 



perplexity PP measures how well Plm models S: 

pp = Plm(^)-'/i^i. 

The intuition behind this expression is that a good language model should assign high probability 
(and therefore low perplexity) to S, since S was generated by the (unknown) source distribution 
for the language. Another way to look at it is to regard the perplexity as measuring the average 
branching of the text from the point of the language model. For example, suppose we have two 
language models. Pi and P2. If it turns out that according to Pi, the only words that have a 
high probability of occurring after the word "San" are "Juan" and "Jose" , whereas according to 
P2 , "Juan" , "Jose" , "cat" , and "dog" all have a high probability of occurring after "San" , then we 



^Jelinek et al. note that the perplexity is a more accurate measure of the difficulty of recognition if the sample is 
large. 
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would say that Pi is the better language model. It is more certain about which words can follow 
"San"; one could say that it is less perplexed. 

In this chapter, we model the probabilities of pairs of adjacent words rather than object-verb 
pairs; that is, X = y, and the pair {x, y) refers to the event that the two-word sequence, or bigram, 
"x y" occurred in the training sample. We thus tackle the problem of bigram language modeling, 
which is a special case of n-gram language modeling; n-grams are the dominant language-modeling 
technology in speech recognition today. In a bigram language model, the probability of a string of 
words is factored into a product of conditional word pair probabilities: 

Plm{wiW2 . . . m„) = Y\_PLM{Wi\Wi-i). 

i 

Then, the perplexity of a bigram model Plm with respect to the string wiW2 ■ ■ ■ Wn is 

^]^PLM(f«j|wi_i)^ = exp ^-^^log Plm {wi\wi^i)^ , 

where base 10 logarithm and exponential functions are used throughout this chapter, as in chapter 
i . 

Given our concern with the practicality of similarity-based estimation, we will also consider 
several heuristics for improving the efficiency and performance of similarity-based models. In par- 
ticular, we will be interested in the effect of limiting the number of similar words that are consulted 
in making an estimate for a particular bigram. Another heuristic we apply is to interpolate the 
similarity information with the unigram (single word) probability used by Katz's back-off method. 
We find that combining these two estimates does improve performance, although it is best not to 
rely too much on the unigram probability (this is a gratifying result, as it tells us that the similarity 
information is more important than the unigram information). 

The rest of this chapter proceeds as follows. Section |5.2| explains the modifications we make to 
the similarity-based model introduced in the previous chapter. Section |5.3| presents our evaluation 
results: the new similarity model achieved a 20% reduction in perplexity with respect to Katz's 
back-off model on unseen bigrams in Wall Street Journal data. These constituted just 10.6% of the 
test sample, leading to an overall reduction in test-set perplexity of 2.4%. We also experimented with 
an application of our language modeling technique to speech recognition, and found that it yielded 



a statistically significant reduction in recognition error. Section 5.4 points out some directions for 
further research. 



5.2 The Similarity Model 

Recall the general form for similarity-based models developed in chapter |[ 

P(v\x) = /^'^(2/la;) if C{x,y) > 
^ ' ^ |_a(x)Fr(2/|a;) otherwise ((x, y) is unseen) ' ^"'^ 

We defined Pr to be Psim, where 

PsiM{y\x)^ —^^i^^^^L—P{y\x'). (U) 

Now, in the last chapter, we simply set S{x), the set of objects most similar to x, to be equal to 
the set X. From a computational standpoint, though, this is somewhat unsatisfactory if X is large. 
Furthermore, it might well be the case that only a few of the closest objects contribute to the sum 
in ( |4.2| ). Therefore, we experiment in this chapter with limiting the size of iS(a;). We now introduce 
parameters k and t, and define S{wi) to be the set of at most k words w'l (excluding wi itself) that 
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satisfy D{wi \\ w[) < t. We need to tune k and t experimentally. 

We will use the KL divergence as distance function in the experiments described below, since we 
did not provide performance results for it in the previous chapter. Recall that the weight function 
W{x,x') for the KL divergence was defined to be 

Again, the parameter /3 controls the relative contribution of words at different distances from x: as 
P increases, the nearest words to /3 get relatively more weight. As /3 decreases, remote words have 



a larger effect on the sum (4.2). Like k and t, (3 is tuned experimentally. 

While in the preceding chapter we set Pr to be Psim, we shall see that it is better to smooth 
PsiM by interpolating it with the unigram probability P{y) (recall that Katz used P{y) as Pr{y\x)). 
Using linear interpolation we get 

Priy\x) = 7P(y) + (1 - -f)PsiM{y\x) , (5.1) 

where 7 is an experimentally-determined interpolation parameter. This smoothing appears to com- 
pensate for inaccuracies in PsiAiiylx), mainly for infrequent conditioning words. However, as the 
evaluation below shows, good values for 7 are small, that is, the similarity-based model plays a 
stronger role than the independence assumption. 

To summarize, we construct a similarity-based model for P(jj\x) and then interpolate it with 



P{y)- The interpolated model (5.1) is used as the probability redistribution model Pr in ( |2.3| ) 
to obtain better estimates for unseen bigrams. Four parameters, to be tuned experimentally, are 
relevant for this process: k and i, which determine the set of similar words to be considered, /3, 
which determines the relative effect of these words, and 7, which determines the overall importance 
of the similarity-based model. 



5.3 Evaluation 

We evaluated our method by comparing its perplexity and effect on speech-recognition accuracy 
with the baseline bigram back-off model developed by MIT Lincoln Laboratories for the Wall Street 
Journal (WSJ) text and dictation corpora provided by ARPA's HLT program (Paul, 1991).^ The 



baseline back-off model closely follows the Katz design discussed in section |2.2| , except that for 
the sake of compactness all singleton bigrams are treated as unseen (recall that this omission of 
singletons was quite detrimental to the simple similarity-based models considered in the previous 
chapter). The counts used in this model and in ours were obtained from 40.5 million words of WSJ 
text from the years 1987-89. 

For the perplexity evaluation, we tuned the similarity model parameters by minimizing perplexity 
via a simple grid search on an additional sample of 57.5 thousand words of WSJ text drawn from 
the ARPA HLT development test set. The best parameter values found were k = 60, t = 2.5, 
(3 = 4: and 7 = 0.15. For these values, the improvement in perplexity for unseen bigrams in a 
held-out 18 thousand word sample, in which 10.6% of the bigrams are unseen, is just over 20%. This 
improvement on unseen bigrams corresponds to an overall test set perplexity improvement of 2.4% 



(from 237.4 to 231.7). Table |5.1| shows reductions in training and test perplexity, sorted by training 



reduction, for different choices of k. The values of /?, 7 and t are the best ones found for each k. 



From equation (L2), it is clear that the computational cost of applying the similarity model to an 
unseen bigram is 0{k). Therefore, lower values of k (and t as well) are computationally preferable. 
From the table, we can see that reducing k to 30 incurs a penalty of less than 1% in the perplexity 
improvement, so relatively low values of k appear to be sufficient to achieve most of the benefit of 
the similarity model. As the table also shows, the best value of 7 increases as k decreases, that is. 



^The ARPA WSJ development corpora come in two versions, one with verbalized punctuation and the other 
without. We used the latter in all our experiments. 
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k 


t 


/? 


7 


training reduction (%) 


test reduction (%) 


60 


2.5 


4 


0.15 


18.4 


20.51 


50 


2.5 


4 


0.15 


18.38 


20.45 


40 


2.5 


4 


0.2 


18.34 


20.03 


30 


2.5 


4 


0.25 


18.33 


19.76 


70 


2.5 


4 


0.1 


18.3 


20.53 


80 


2.5 


4.5 


0.1 


18.25 


20.55 


100 


2.5 


4.5 


0.1 


18.23 


20.54 


90 


2.5 


4.5 


0.1 


18.23 


20.59 


20 


1.5 


4 


0.3 


18.04 


18.7 


10 


1.5 


3.5 


0.3 


16.64 


16.94 



Table 5.1: Perplexity reduction on unseen bigrams for different model parameters 



for lower k a greater weight is given to the conditioned word's frequency. This suggests that the 
predictive power of neighbors beyond the closest 30 or so can be modeled fairly well by the overall 
frequency of the conditioned word. 

The bigram similarity model was also tested as a language model in speech recognition. The test 
data for this experiment were pruned word lattices for 403 WSJ closed-vocabulary test sentences. 
Arc scores in those lattices are sums of an acoustic score (negative log likelihood) and a language- 
model score, in this case the negative log probability provided by the baseline bigram model. 

From the given lattices, we constructed new lattices in which the arc scores were modified to 
use the similarity model instead of the baseline model. We compared the best sentence hypothesis 
in each original lattice and in the modified one, and counted the word disagreements in which one 
of the hypotheses is correct. There were a total of 96 such disagreements. The similarity model 
was correct in 64 cases, and the back-off model in 32. This advantage for the similarity model is 
statistically significant at the 0.01 level. The overall reduction in error rate is small (from 21.4% to 
20.9%) because the number of disagreements is small compared with the overall number of errors in 
the recognition setup used in these experiments. 

Table 5.2 shows some examples of speech recognition disagreements between the two models. 
The hypotheses are labeled 'B' for back-off and 'S' for similarity, and the bold-face words are errors. 
The similarity model seems to be better at modeling regularities such as semantic parallelism in 
lists and avoiding a past tense form after "to." On the other hand, the similarity model makes 
several mistakes in which a function word is inserted in a place where punctuation would be found 
in written text. 



5.4 Further Research 

The model presented in this chapter provides a modification of the scheme for similarity-based esti- 
mation described in the preceding chapter; several heuristics for improving speed and performance 
were incorporated. We have demonstrated that the augmented model can be of use in practical 
speech recognition systems. We now discuss some possible further directions to explore. 

It may be possible to simplify the current model parameters somewhat, especially with respect 
to the parameters t and k used to select the nearest neighbors of a word. On the other hand, it may 
be the case that using the same t and k for all words is too simplistic, although training a model in 
which t and k differ from word to word would involve massive sparse data problems. 

A more substantial variation would be to base the model on the similarity between conditioned 
words (y) rather than on the similarity between conditioning words (x). For example, Essen and 



Steinbiss's variation 1 considers the confusion probability (4.3) of contexts rather than objects (Essen 



and Stcinbiss, 1992). However, they noted that model 1-A was equivalent to model 2-B (which we 



discussed in section 4.4.2; it uses the confusion probability of conditioning words), and that their 
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B 


commitments . . . from leaders felt the three point six billion dollars 


S 


commitments . . . from leaders fell to three point six billion dollars 


B 


followed by France the US agreed in Italy 


S 


followed by France the US Greece . . . Italy 


B 


he whispers to made a 


S 


he whispers to an aide 


B 


the necessity for change exist 


S 


the necessity for change exists 


B 


without . . . additional reserves Ccntrust would have reported 


S 


without . . . additional reserves of Ccntrust would have reported 


B 


in the darkness past the church 


S 


in the darkness passed the church 



Table 5.2: Speech recognition disagreements between models 



other model using variation 1 did not perform as well. 

Other evidence may be combined with the similarity-based estimate. For instance, it may be 
advantageous to weigh the similarity-based estimate by some measure of the reliability of the similar- 
ity function and of the neighbor distributions. A second possibility is to take into account negative 
evidence, as Hatzivassiloglou and McKcown (1993| ) did (see the discussion in section 3.6.1 ). For 
example, if x is frequent, but y never followed it, there may be enough statistical evidence to put 
an upper bound on the estimate of P{y\x). This may require an adjustment of the similarity-based 
estimate, possibly along the lines of the work of Rosenfcld and Huang (199^ ). 

Finally, the similarity-based model may be applied to configurations other than bigrams. For 
trigrams, it is necessary to measure similarity between different conditioning bigrams. This can 
be done directly, by measuring the distance between distributions of the form P(w^\wi,W2), corre- 
sponding to different bigrams {'Wi^W2)- Alternatively, and more practically, it may be possible to 
define a similarity measure between trigrams as a function of the similarities between corresponding 
words in them. 



5.5 Conclusions 

Similarity-based models suggest an appealing approach to dealing with data sparseness. Based on 
corpus statistics, they provide analogies between words that often agree with our linguistic and 
domain intuitions. In the previous chapter we looked at the performance of various instantiations 
of a simple similarity-based model. In this chapter we presented a variant that provides noticeable 
improvement over Katz's back-off estimation method on realistic evaluation tasks. 

The improvement we achieved for a bigram model is statistically significant, although it is modest 
in its overall effect because of the small proportion of unseen events. While we have used bigrams as 
an easily accessible platform to develop and test the model, more substantial improvements might 
be obtainable for more informative configurations. An obvious case is that of trigrams, for which 
the sparse data problem is much more severe. For example, Doug Paul (personal communication) 
reports that for WSJ trigrams over a 20000 word vocabulary, only 58.6% of the test set trigrams 
occurred in 40 million of words of training data. 
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Chapter 6 

Conclusion 



This paper is an absolute leviathan! Reverberating with history and personal recollection 
and occasionally exploding with well-aimed critical bursts, it sweeps you up like a great 
tidal wave and carries you along for over one hundred pages at an accelerating tempo, 
leaving you at the end with a sense that its driving energy has still not spent itself.... 



(Rosenkrantz, 1983, pg. viii) 



We have presented two ways to make use of distributional similarity for applications in natural 
language processing. The first was a distributional clustering method, which proved not only to 
create clusters that seem to correspond to intuitive sense distinctions but also to lead to a cluster- 
based language model with good predictive power. Our clustering method yields soft, hierarchical 
clusters. Our use of soft clustering in a language processing context appears to be novel, but is 
rather natural, since many words are ambiguous. 

We also presented a nearest-neighbor approach, where we combined estimates from similar words 
rather than from cluster centroids. This approach has the advantage of computational efficiency, 
since we do not need to engage in the iterative estimation necessary in our clustering work. We 
showed that methods based on the KL divergence provided substantial improvement over Katz's 
back-off method for unseen word pairs, and noticeable improvement over Essen and Steinbiss's 
confusion probability on a pseudo-word disambiguation task. To further demonstrate that similarity 
information can be helpful for applications, we also showed that an extension of our similarity-based 
model can produce both perplexity reduction and speech recognition error-rate reduction. 

We can only conclude that the incorporation of similarity information has the potential to provide 
better results in the area of language modeling. But our techniques may extend farther than that. 
Indeed, our clustering work certainly seems applicable to other problems such as automatic thesaurus 
construction or lexicon acquisition. The fact that the clusterings we produce are probabilistic may 
again be an advantage, since for instance words may appear in more than one thesaurus category. It 
would also be interesting to experiment with applying our techniques to the problems of document 
clustering and indexing, as mentioned at the end of chapter |[ 

This brings up a deeper question, though. What is the proper way to evaluate the inherent 
quality of clusterings (as opposed to measuring the performance gain clusterings can provide)? We 
need a good way to talk about how different one clustering is from another in order to analyze 
competing clustering methods. As we move to larger and larger data sets, it becomes more and 



more impractical to perform an evaluation such as that described by Hatzivassiloglou and McKeown 
|(1993D where automatically-derived classes were compared to classes created by humans. Perhaps 
one fruitful direction, at least for hierarchical clusterings, would be to look at edit distances between 



trees (see, e.g., Kannan, Warnow, and Yooseph (1995)) 



Another key question to address is whether we can formulate adaptive versions of our algorithms. 
Since our methods do not rely on heavily annotated samples, it is easy to acquire new training data. 
What we would like is a way to incorporate new information without having to restart the clustering 
process (in the distributional clustering case) or recalculate the similarity matrix (in the nearest- 
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neighbor case). For clustering, the fact that we use soft classes may once again provide the answer, 
since we can reestimate membership probabilities as new data comes in; if we built hard clusters, we 
would have to adjust for prematurely grouping two objects together or splitting two objects apart. 
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